简体   繁体   中英

Converting multiple xml files/links to JSON using Python?

I know how to convert a single xml file or link to json in python using xmltodict. I was however wondering if there was any efficient way to convert multiple xml files(in order of hundreds or even thousand) to json in Python? Or, instead of Python, if there is any other tool better suited to it? Please note that I am not a very skilled programmer and have only used Python sporadically.

It depends on the specific case you are working on.

My example case (for background):

For instance, once I had to read data from a big set ( 1-million-word subcorpus ) (around 2,6 GB) consisting of 3890 directories where there was an ann_morphosyntax.xml file in each one of them.

A snippet from one of ann_morphosyntax.xml files for reference:

<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xmlns:nkjp="http://www.nkjp.pl/ns/1.0" xmlns:xi="http://www.w3.org/2001/XInclude">
 <xi:include href="NKJP_1M_header.xml"/>
 <TEI>
  <xi:include href="header.xml"/>
  <text>
   <body>
    <p corresp="ann_segmentation.xml#segm_1-p" xml:id="morph_1-p">
     <s corresp="ann_segmentation.xml#segm_1.5-s" xml:id="morph_1.5-s">
      <seg corresp="ann_segmentation.xml#segm_1.1-seg" xml:id="morph_1.1-seg">
       <fs type="morph">
        <f name="orth">
         <string>Jest</string>
        </f>

Every of those ann_morphosyntax.xml files contained one or more objects (let's say paragraphs for simplicity) that I needed to convert to JSON format each. Such paragraph object starts with <p in xml file snippet above.

Additionally, there was also a need of keeping those JSONs in one file and decreasing the size of that file to the lowest possible, so I've decided to use JSONL format. This file format allows you to store every JSON as one line of that file without any spaces, which eventually let me decrease the size of the initial data set to around 450 MB.

I've implemented a solution in Python 3.6 . What I did is:

  1. I've used iglob to iterate through that directories in order to take ann_morphosyntax.xml file from each of them.
  2. To parse each ann_morphosyntax.xml file I've used the ElementTree library.
  3. I've saved those JSONs in output.jsonl file.

Solution :

To try this solution by yourself do as follows:

  1. Run this script to create two files in the output directory of the root directory of your project: example_1.xml and example_2.xml :
import os
import xml.etree.ElementTree as ET


def prettify(element, indent='  '):
   queue = [(0, element)]  # (level, element)
   while queue:
       level, element = queue.pop(0)
       children = [(level + 1, child) for child in list(element)]
       if children:
           element.text = '\n' + indent * (level+1)  # for child open
       if queue:
           element.tail = '\n' + indent * queue[0][0]  # for sibling open
       else:
           element.tail = '\n' + indent * (level-1)  # for parent close
       queue[0:0] = children  # prepend so children come before siblings


def _create_word_object(sentence_object, number, word_string):
   word = ET.SubElement(sentence_object, 'word', number=str(number))
   string = ET.SubElement(word, 'string', number=str(number))
   string.text = word_string


def create_two_xml_files():
   xml_doc_1 = ET.Element('paragraph', number='1')
   xml_doc_2 = ET.Element('paragraph', number='1')
   sentence_1 = ET.SubElement(xml_doc_1, 'sentence', number='1')
   sentence_2 = ET.SubElement(xml_doc_2, 'sentence', number='1')
   _create_word_object(sentence_1, 1, 'This')
   _create_word_object(sentence_2, 1, 'This')
   _create_word_object(sentence_1, 2, 'is')
   _create_word_object(sentence_2, 2, 'is')
   _create_word_object(sentence_1, 3, 'first')
   _create_word_object(sentence_2, 3, 'second')
   _create_word_object(sentence_1, 4, 'example')
   _create_word_object(sentence_2, 4, 'example')
   _create_word_object(sentence_1, 5, 'sentence')
   _create_word_object(sentence_2, 5, 'sentence')
   _create_word_object(sentence_1, 6, '.')
   _create_word_object(sentence_2, 6, '.')
   prettify(xml_doc_1)
   prettify(xml_doc_2)
   tree_1 = ET.ElementTree(xml_doc_1)
   tree_2 = ET.ElementTree(xml_doc_2)
   os.mkdir('output')
   tree_1.write('output/example_1.xml', encoding='UTF-8', xml_declaration=True)
   tree_2.write('output/example_2.xml', encoding='UTF-8', xml_declaration=True)


def main():
   create_two_xml_files()


if __name__ == '__main__':
   main()

  1. Then run this script that will iterate through example_1.xml and example_2.xml files (using iglob) and create output.jsonl file (that will be saved in the root directory of your project) with data from two XML files created in the first step:
import os
import glob
import errno
import jsonlines
import xml.etree.ElementTree as ET


class Word:
    def __init__(self, word_id, word_text):
        self.word_id = word_id
        self.word_text = word_text

    def create_word_dict(self):
        return {"word": {"id": self.word_id, "text": self.word_text}}


def parse_xml(file_path):
    for event, element in ET.iterparse(file_path, events=("start", "end",)):
        if event == "end":
            if element.tag == 'word':
                yield Word(element[0].get('number'), element[0].text)
                element.clear()


def write_dicts_from_xmls_in_directory_to_jsonlines_file(parsing_generator):
    path = os.path.abspath(os.path.dirname(os.path.abspath(__file__))) + '/output/*'
    xml_files = glob.iglob(path)
    with jsonlines.open('output.jsonl', mode='a') as writer:
        for xml_file_name in xml_files:
            try:
                with open(xml_file_name):
                    for next_word in parsing_generator(xml_file_name):
                        writer.write(next_word.create_word_dict())
            except IOError as exec:
                if exec.errno != errno.EISDIR:
                    raise


def main():
    write_dicts_from_xmls_in_directory_to_jsonlines_file(parse_xml)


if __name__ == '__main__':
    main()

The output.jsonl file will contain, in each line, a JSON object representing word element that can be found in example_1.xml and example_2.xml files generated in the first step.

You can elaborate on that example and make it more suitable for your needs.

PS

The first script is based on post Pretty printing XML in Python

I have the same problems

I have use this code in Python to convert many xml filed in one directory.

import xmltodict
import os
import json 
path = "/home/bjorn/Nedlastinger/Doffin/1/5/"
for filename in os.listdir(path):
    if not filename.endswith('.xml'):
        continue

fullname = os.path.join(path, filename)

with open(fullname, 'r') as f:
    xmlString = f.read()

jsonString = json.dumps(xmltodict.parse(xmlString, process_namespaces=True))

with open(fullname[:-4] + ".json", 'w') as f:
    f.write(jsonString)

But it don't control the data types. So every number is converted to string and you get a job to clean the data after.

I have loaded everything in a NoSQL server like Couchbase. The code in Couchbase for converting sting to number are

UPDATE doffin SET DOFFIN_ESENDERS.FORM_SECTION.CONTRACT_AWARD.FD_CONTRACT_AWARD.AWARD_OF_CONTRACT.CONTRACT_VALUE_INFORMATION.COSTS_RANGE_AND_CURRENCY_WITH_VAT_RATE.VALUE_COST = TONUMBER(DOFFIN_ESENDERS.FORM_SECTION.CONTRACT_AWARD.FD_CONTRACT_AWARD.AWARD_OF_CONTRACT.CONTRACT_VALUE_INFORMATION.COSTS_RANGE_AND_CURRENCY_WITH_VAT_RATE.VALUE_COST);

doffin is my dB name.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM