Essentially, I have a 6.4GB XML file that I'd like to convert to JSON then save it to disk. I'm currently running OSX 10.8.4 with an i7 2700k and 16GBs of ram, and running Python 64bit (double checked). I'm getting an error that I don't have enough memory to allocate. How do I go about fixing this?
print 'Opening'
f = open('large.xml', 'r')
data = f.read()
f.close()
print 'Converting'
newJSON = xmltodict.parse(data)
print 'Json Dumping'
newJSON = json.dumps(newJSON)
print 'Saving'
f = open('newjson.json', 'w')
f.write(newJSON)
f.close()
The Error:
Python(2461) malloc: *** mmap(size=140402048315392) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "/Users/user/Git/Resources/largexml2json.py", line 10, in <module>
data = f.read()
MemoryError
Many Python XML libraries support parsing XML sub elements incrementally, eg xml.etree.ElementTree.iterparse
and xml.sax.parse
in the standard library. These functions are usually called "XML Stream Parser".
The xmltodict library you used also has a streaming mode. I think it may solve your problem
Instead of trying to read the file in one go and then process it, you want to read it in chunks and process each chunk as it's loaded. This is a fairly common situation when processing large XML files and is covered by the Simple API for XML (SAX) standard, which specifies a callback API for parsing XML streams - it's part of the Python standard library under xml.sax.parse
and xml.etree.ETree
as mentioned above.
Here's a quick XML to JSON converter:
from collections import defaultdict
import json
import xml.etree.ElementTree as ET
def parse_xml(file_name):
events = ("start", "end")
context = ET.iterparse(file_name, events=events)
return pt(context)
def pt(context, cur_elem=None):
items = defaultdict(list)
if cur_elem:
items.update(cur_elem.attrib)
text = ""
for action, elem in context:
# print("{0:>6} : {1:20} {2:20} '{3}'".format(action, elem.tag, elem.attrib, str(elem.text).strip()))
if action == "start":
items[elem.tag].append(pt(context, elem))
elif action == "end":
text = elem.text.strip() if elem.text else ""
elem.clear()
break
if len(items) == 0:
return text
return { k: v[0] if len(v) == 1 else v for k, v in items.items() }
if __name__ == "__main__":
json_data = parse_xml("large.xml")
print(json.dumps(json_data, indent=2))
If you're looking at a lot of XML processing check out the lxml
library, it's got a ton of useful stuff over and above the standard modules, while also being much easier to use.
Here's a Python3 script for converting XML of a certain structure to JSON using xmltodict's streaming feature. The script keeps very little in memory so there is no limit on the size of the input. This makes a lot of assumptions but may get you started, your mileage will vary, hope this helps.
#!/usr/bin/env python3
"""
Converts an XML file with a single outer list element
and a repeated list member element to JSON on stdout.
Processes large XML files with minimal memory using the
streaming feature of https://github.com/martinblech/xmltodict
which is required ("pip install xmltodict").
Expected input structure (element names are just examples):
<mylist attr="a">
<myitem name="foo"></myitem>
<myitem name="bar"></myitem>
<myitem name="baz"></myitem>
</mylist>
Output:
{
"mylist": {
"attr": "a",
"myitem": [
{
"name": "foo"
},
{
"name": "bar"
},
{
"name": "baz"
}
]
}
}
"""
import json
import os
import sys
import xmltodict
ROOT_SEEN = False
def handle_item(path, element):
"""
Called by xmltodict on every item found at the specified depth.
This requires a depth >= 2.
"""
# print("path {} -> element: {}".format(path, element))
global ROOT_SEEN
if path is None and element is None:
# after element n
print(']') # list of items
print('}') # outer list
print('}') # root
return False
elif ROOT_SEEN:
# element 2..n
print(",")
else:
# element 1
ROOT_SEEN = True
print('{') # root
# each path item is a tuple (name, OrderedDict)
print('"{}"'.format(path[0][0]) + ': {') # outer list
# emit any root element attributes
if path[0][1] is not None and len(path[0][1]) > 0:
for key, value in path[0][1].items():
print('"{}":"{}",'.format(key, value))
# use the repeated element name for the JSON list
print('"{}": ['.format(path[1][0])) # list of items
# Emit attributes and contents by merging the contents into
# the ordered dict of attributes so the attr appear first.
if path[1][1] is not None and len(path[1][1]) > 0:
ordict = path[1][1]
ordict.update(element)
else:
ordict = element
print(json.dumps(ordict, indent=2))
return True
def usage(args, err=None):
"""
Emits a message and exits.
"""
if err:
print("{}: {}".format(args[0], err), file=sys.stderr)
print("Usage: {} <xml-file-name>".format(args[0]), file=sys.stderr)
sys.exit()
if __name__ == '__main__':
if len(sys.argv) != 2:
usage(sys.argv)
xmlfile = sys.argv[1]
if not os.path.isfile(xmlfile):
usage(sys.argv, 'Not found or not a file: {}'.format(xmlfile))
with open(xmlfile, 'rb') as f:
# Set item_depth to turn on the streaming feature
# Do not prefix attribute keys with @
xmltodict.parse(f, item_depth=2, attr_prefix='', item_callback=handle_item)
handle_item(None, None)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.