I'm breaking up the 4GB Wiktionary XML data dump into smaller files, with no overlap, processing it with Python and saving distinct pages (...).
The same info, split across different files, is ballooning to 18+ GB.
Why might this be? And is there a way to avoid this?
import os
import re
import subprocess
subprocess.call(['mkdir', 'WIKTIONARY_WORDS_DUMP'])
# English Wiktionary (which noneless contains many foreign words!)
f = open('enwiktionary-20151020-pages-articles.xml', 'r')
page = False
number = 1
for i, l in enumerate(f):
if '<page>' in l:
word_file = open(os.path.join('WIKTIONARY_WORDS_DUMP', str(number)+'.xml'), 'a')
word_file.write(l)
page = True
number += 1
elif '</page>' in l:
word_file.write(l)
word_file.close()
page = False
elif page:
word_file.write(l)
word_file.close()
f.close()
Are the smaller files also saved as XML, with the same tag hierarchy? If so, you're bound to have some tag repetition.
ie if you were to split this file:
<root>
<item>abc</item>
<item>def</item>
<item>ghi</item>
</root>
into three separate files:
<root>
<item>abc</abc>
</root>
<root>
<item>def</abc>
</root>
<root>
<item>ghi</abc>
</root>
The <root>
tag is repeated in each smaller file.
It gets worse if your data scheme is more complex:
<root>
<level1>
<level2>
<level3>
<item>abc</item>
</level3>
</level2>
</level1>
</root>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.