I have a 20gb bz2 xml file. the format is like this:
<doc id="1" url="https://www.somepage.com" title="some page">
text text text ....
</doc>
I need to process it to tsv file in this format:
id<tab>url<tab>title<tab>processed_texts
What is the most efficient way of doing it in python and java and what are the differences (memory efficiency and speed wise). Basically I want to do this:
read bz2 file
read the xml file element by element
for each element
retrieve id, url, title and text
print_to_file(id<tab>url<tab>title<tab>process(text))
Thanks for your answers in advance.
UPDATE1 (Based on @Andreas suggestions):
XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
xmlReader.nextTag();
if (! xmlReader.getLocalName().equals("doc")) {
xmlReader.nextTag(); }
String id = xmlReader.getAttributeValue(null, "id");
String url = xmlReader.getAttributeValue(null, "url");
String title = xmlReader.getAttributeValue(null, "title");
String content = xmlReader.getElementText();
out.println(id + '\t' + content);
The problem is that I only get the first element.
UPDATE2 (I ended up doing it using regex):
if (str.startsWith("<doc")) {
id = str.split("id")[1].substring(2).split("\"")[0];
url = str.split("url")[1].substring(2).split("\"")[0];
title = str.split("title")[1].substring(2).split("\"")[0];
}
else if (str.startsWith("</doc")) {
out.println(uniq_id + '\t' + contect);
content ="";
}
else {
content = content + " " + str;
}
Note: The answer below works well for parsing very large BZ2 compressed XML documents , however OP's XML file is not well-formed since there is no root element, ie it's an XML fragment .
The built-in StAX parser does not support XML fragments, however the Woodstox XML processor supposedly supports this, according to this answer: Parsing multiple XML fragments with STaX .
Java Answer
As answered in this question ( Uncompress BZIP2 archive ), you need Apache Commons Compress™ to read BZ2 files.
You would then use the built-in StAX parser:
File xmlFile = new File("input.xml");
File textFile = new File("output.txt");
try (InputStream in = new BZip2CompressorInputStream(new FileInputStream(xmlFile));
PrintWriter out = new PrintWriter(new FileWriter(textFile))) {
XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
try {
xmlReader.nextTag(); // Read root element, ignore it
if (xmlReader.getLocalName().equals("doc"))
throw new IllegalArgumentException("Expected root element, found <doc>");
while (xmlReader.nextTag() == XMLStreamConstants.START_ELEMENT) {
if (! xmlReader.getLocalName().equals("doc"))
throw new IllegalArgumentException("Expected <doc>, found <" + xmlReader.getLocalName() + ">");
String id = xmlReader.getAttributeValue(null, "id");
String url = xmlReader.getAttributeValue(null, "url");
String title = xmlReader.getAttributeValue(null, "title");
String content = xmlReader.getElementText();
// process content value
out.println(id + '\t' + url + '\t' + title + '\t' + content);
}
} finally {
xmlReader.close();
}
}
Fast and low memory footprint.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.