简体   繁体   中英

Python - Convert Very Large (6.4GB) XML files to JSON

Essentially, I have a 6.4GB XML file that I'd like to convert to JSON then save it to disk. I'm currently running OSX 10.8.4 with an i7 2700k and 16GBs of ram, and running Python 64bit (double checked). I'm getting an error that I don't have enough memory to allocate. How do I go about fixing this?

print 'Opening'
f = open('large.xml', 'r')
data = f.read()
f.close()

print 'Converting'
newJSON = xmltodict.parse(data)

print 'Json Dumping'
newJSON = json.dumps(newJSON)

print 'Saving'
f = open('newjson.json', 'w')
f.write(newJSON)
f.close()

The Error:

Python(2461) malloc: *** mmap(size=140402048315392) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "/Users/user/Git/Resources/largexml2json.py", line 10, in <module>
    data = f.read()
MemoryError

Many Python XML libraries support parsing XML sub elements incrementally, eg xml.etree.ElementTree.iterparse and xml.sax.parse in the standard library. These functions are usually called "XML Stream Parser".

The xmltodict library you used also has a streaming mode. I think it may solve your problem

https://github.com/martinblech/xmltodict#streaming-mode

Instead of trying to read the file in one go and then process it, you want to read it in chunks and process each chunk as it's loaded. This is a fairly common situation when processing large XML files and is covered by the Simple API for XML (SAX) standard, which specifies a callback API for parsing XML streams - it's part of the Python standard library under xml.sax.parse and xml.etree.ETree as mentioned above.

Here's a quick XML to JSON converter:

from collections import defaultdict
import json
import xml.etree.ElementTree as ET

def parse_xml(file_name):
    events = ("start", "end")
    context = ET.iterparse(file_name, events=events)

    return pt(context)

def pt(context, cur_elem=None):
    items = defaultdict(list)

    if cur_elem:
        items.update(cur_elem.attrib)

    text = ""

    for action, elem in context:
        # print("{0:>6} : {1:20} {2:20} '{3}'".format(action, elem.tag, elem.attrib, str(elem.text).strip()))

        if action == "start":
            items[elem.tag].append(pt(context, elem))
        elif action == "end":
            text = elem.text.strip() if elem.text else ""
            elem.clear()
            break

    if len(items) == 0:
        return text

    return { k: v[0] if len(v) == 1 else v for k, v in items.items() }

if __name__ == "__main__":
    json_data = parse_xml("large.xml")
    print(json.dumps(json_data, indent=2))

If you're looking at a lot of XML processing check out the lxml library, it's got a ton of useful stuff over and above the standard modules, while also being much easier to use.

http://lxml.de/tutorial.html

Here's a Python3 script for converting XML of a certain structure to JSON using xmltodict's streaming feature. The script keeps very little in memory so there is no limit on the size of the input. This makes a lot of assumptions but may get you started, your mileage will vary, hope this helps.

#!/usr/bin/env python3
"""
Converts an XML file with a single outer list element
and a repeated list member element to JSON on stdout.
Processes large XML files with minimal memory using the
streaming feature of https://github.com/martinblech/xmltodict
which is required ("pip install xmltodict").

Expected input structure (element names are just examples):
  <mylist attr="a">
    <myitem name="foo"></myitem>
    <myitem name="bar"></myitem>
    <myitem name="baz"></myitem>
  </mylist>

Output:
  {
    "mylist": {
      "attr": "a",
      "myitem": [
        {
          "name": "foo"
        },
        {
          "name": "bar"
        },
        {
          "name": "baz"
        }
      ]
    }
  }
"""
import json
import os
import sys
import xmltodict


ROOT_SEEN = False


def handle_item(path, element):
    """
    Called by xmltodict on every item found at the specified depth.
    This requires a depth >= 2.
    """
    # print("path {} -> element: {}".format(path, element))
    global ROOT_SEEN
    if path is None and element is None:
        # after element n
        print(']')  # list of items
        print('}')  # outer list
        print('}')  # root
        return False
    elif ROOT_SEEN:
        # element 2..n
        print(",")
    else:
        # element 1
        ROOT_SEEN = True
        print('{')  # root
        # each path item is a tuple (name, OrderedDict)
        print('"{}"'.format(path[0][0]) + ': {')  # outer list
        # emit any root element attributes
        if path[0][1] is not None and len(path[0][1]) > 0:
            for key, value in path[0][1].items():
                print('"{}":"{}",'.format(key, value))
        # use the repeated element name for the JSON list
        print('"{}": ['.format(path[1][0]))  # list of items

    # Emit attributes and contents by merging the contents into
    # the ordered dict of attributes so the attr appear first.
    if path[1][1] is not None and len(path[1][1]) > 0:
        ordict = path[1][1]
        ordict.update(element)
    else:
        ordict = element
    print(json.dumps(ordict, indent=2))
    return True


def usage(args, err=None):
    """
    Emits a message and exits.
    """
    if err:
        print("{}: {}".format(args[0], err), file=sys.stderr)
    print("Usage: {} <xml-file-name>".format(args[0]), file=sys.stderr)
    sys.exit()


if __name__ == '__main__':
    if len(sys.argv) != 2:
        usage(sys.argv)
    xmlfile = sys.argv[1]
    if not os.path.isfile(xmlfile):
        usage(sys.argv, 'Not found or not a file: {}'.format(xmlfile))
    with open(xmlfile, 'rb') as f:
        # Set item_depth to turn on the streaming feature
        # Do not prefix attribute keys with @
        xmltodict.parse(f, item_depth=2, attr_prefix='', item_callback=handle_item)
    handle_item(None, None)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM