Splitting file dramatically increases its size

Question

I'm breaking up the 4GB Wiktionary XML data dump into smaller files, with no overlap, processing it with Python and saving distinct pages (...).

The same info, split across different files, is ballooning to 18+ GB.

Why might this be? And is there a way to avoid this?

import os 
import re
import subprocess

subprocess.call(['mkdir', 'WIKTIONARY_WORDS_DUMP'])

# English Wiktionary (which noneless contains many foreign words!)
f = open('enwiktionary-20151020-pages-articles.xml', 'r')

page = False
number = 1
for i, l in enumerate(f): 

    if '<page>' in l:
        word_file = open(os.path.join('WIKTIONARY_WORDS_DUMP', str(number)+'.xml'), 'a')
        word_file.write(l)
        page = True
        number += 1

    elif '</page>' in l:
        word_file.write(l)
        word_file.close()
        page = False

    elif page:
        word_file.write(l)


word_file.close()
f.close()

Answer 1

Are the smaller files also saved as XML, with the same tag hierarchy? If so, you're bound to have some tag repetition.

ie if you were to split this file:

<root>
    <item>abc</item>
    <item>def</item>
    <item>ghi</item>
</root>

into three separate files:

<root>
    <item>abc</abc>
</root>

<root>
    <item>def</abc>
</root>

<root>
    <item>ghi</abc>
</root>

The <root> tag is repeated in each smaller file.

It gets worse if your data scheme is more complex:

<root>
    <level1>
        <level2>
            <level3>
                <item>abc</item>
            </level3>
        </level2>
    </level1>
</root>

Splitting file dramatically increases its size

Question

1 answers

solution1
1 2015-10-29 18:28:28

Splitting file dramatically increases its size

Question

1 answers

solution1 1 2015-10-29 18:28:28

solution1
1 2015-10-29 18:28:28