How do I parse a file with XML-like structure, but with self-closing tags next to content (instead of enclosing the content)

Question

I have a file of the following structure. It's not XML, but I need to somehow make a JSON out of it.

So while I would expect the file to look like this:

<chapter>
<line> Some text which I want to grab. </line>
<line> Some more text which I want to grab. </line>
<line> Even more text which I want to grab. </line>
</chapter>

It is in fact structured like this:

<chapter>
<line /> Some text which I want to grab.
<line /> Some more text which I want to grab.
<line /> Even more text which I want to grab.
</chapter>

So the 'lines' of each chapter just stand next to the self-closing line tags. Can you recommend a method of grabbing these? Possibly in javascript / nodejs?

Answer 1

The format is valid XML, so you can use the regular XML techniques ... ie DOMParser , to parse the content

However, you just need to be a bit clever about parsing the lines - you want to find each line, and gather up all the sibling nodes that are text nodes (should be only one, but the code I present doesn't make any assumptions)

You didn't specify the output "structure", but here's one method you could use which outputs a nested array - first level is chapters, in each chapter there's an array of lines

var xml = `<chapter>
<line /> Some text which I want to grab.
<line /> Some more text which I want to grab.
<line /> Even more text which I want to grab.
</chapter>`

var parser = new DOMParser();
var content = parser.parseFromString(xml, 'application/xml')
var chapters = content.getElementsByTagName('chapter');
var obj = [].reduce.call(chapters, function(result, chapter) {
    var lines = chapter.getElementsByTagName('line');
    result.push([].reduce.call(lines, function(result, line) {
        var text = '';
        for(var node = line.nextSibling; node && node.nodeType == 3; node = node.nextSibling) {
            text += node.nodeValue;
        }
        result.push(text);
        return result;
    }, []))
    return result;
}, []);
console.log(JSON.stringify(obj));

addressing the comments - firstly some documentation:

DOMParse documentation

Array#reduce documentation

Function#call documentation

Now, to explain [].reduce.call(array, fn) in this code

[].reduce.call is shorthand for Array.prototype.reduce.call

getElementsByTagName returns a HTMLCollection ... which behaves like an array, except it isn't one ... there are several ways to make an array out of a HTMLCollection - the most primitive:

var array = [];
for(var i = 0; i < collection.length; i++) {
    array[i] = collection[i];
}

or

var array = Array.prototype.slice.call(collection);

or (ES2015+) - not available in IE unless you polyfill - see documentation

var array = Array.from(collection);

However, using the .call method on [].reduce allows the first argument (the this argument) to be any iterable, not just an array, and so it's just like using array from above like array.reduce(fn) - it's a way to treat the HTMLcollection like an array, without the need for an intermediate variable

How do I parse a file with XML-like structure, but with self-closing tags next to content (instead of enclosing the content)

Question

1 answers

solution1
2 ACCPTED 2017-03-23 22:11:57

How do I parse a file with XML-like structure, but with self-closing tags next to content (instead of enclosing the content)

Question

1 answers

solution1 2 ACCPTED 2017-03-23 22:11:57

solution1
2 ACCPTED 2017-03-23 22:11:57