Parse out XML from “unstructured” plain text

Question

I am consuming a large text file from a publishing system. It is structured as follows:

-- File header
-- File Attribute 1
-- File Attribute 2

<xml>File summary</xml>

-- Record header
-- Record attribute 1

<xml>Record1</xml>

-- Record 1 header
-- Record attribute 1

<xml>Record1</xml>

-- Record 2 header
-- Record attribute 1

<xml>Record2</xml>

-- Record n header
-- Record attribute 1

<xml>Recordn</xml>

There can be hundreds of thousands of records in a file and the XML is a large structure in a single line . The line size can be hundered of thousands of characters long.

First up, yes it's bonkers - my first task is to go back to the publishing system and explain how XML works! ;) In the mean time, I need a way of stripping out the XML and building a structured output file:

<xml>
    <header/>
    <listofrecords>
        <record1/>
        <record2/>
        <recordn/>
    </listofrecords>
</xml>

Note that I have no interest in the contents of the text header contents.

I'm struggling to undertand the quickest and most maintainable way to do this.

My thoughts are to use Java and a BufferedReader to parse the input file line by line. Where I encounter an XML tag, I read to the closing XML tag and add to an output file structure.

Is there a faster way to do this? Could RegEx help identify the text that I need to extract into the new format?

Sorry that this is quite an open ended question and I'd understand if it's not quite in scope for Stack Overflow. Any thoughts greatly appreciated, though

Answer 1

I would use a perl script

#! /usr/bin/perl
#
print "<xml>\n";
while($line = <>) {
    if ($line =~ m!-- File (.*)!) {
        print "    <header $1/>\n";
        print "    <listofrecords>\n";
        last;
    }
}
while($line = <>) {
    if($line =~ m!<xml>(.*)</xml!) {
        print "        <$1/>\n";
    }
}
print "    </listofrecords>\n";
print "</xml>";

Answer 2

You can consider using a DOM parser. If you are dealing with one such large file, surround it by some tag to make it valid XML such as

    <top>
        ...file contents...
    </top>


String xmlPath = "C:/test/xml/publishing_file.xml";
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document dom = builder.parse(xmlPath);

    NodeList nl = dom.getDocumentElement().getChildNodes();
    for(int i = 0; i < nl.getLength(); i++){
        //...this sequence of nodes will be each <xml> tag followed by the text contents between it
    }

a bit easier than parsing each line...

Parse out XML from “unstructured” plain text

Question

2 answers

solution1
1 ACCPTED 2015-02-03 16:31:47

solution2
0 2015-02-03 16:28:46

Parse out XML from “unstructured” plain text

Question

2 answers

solution1 1 ACCPTED 2015-02-03 16:31:47

solution2 0 2015-02-03 16:28:46

solution1
1 ACCPTED 2015-02-03 16:31:47

solution2
0 2015-02-03 16:28:46