简体   繁体   中英

Splitting file dramatically increases its size

I'm breaking up the 4GB Wiktionary XML data dump into smaller files, with no overlap, processing it with Python and saving distinct pages (...).

The same info, split across different files, is ballooning to 18+ GB.

Why might this be? And is there a way to avoid this?

import os 
import re
import subprocess

subprocess.call(['mkdir', 'WIKTIONARY_WORDS_DUMP'])

# English Wiktionary (which noneless contains many foreign words!)
f = open('enwiktionary-20151020-pages-articles.xml', 'r')

page = False
number = 1
for i, l in enumerate(f): 

    if '<page>' in l:
        word_file = open(os.path.join('WIKTIONARY_WORDS_DUMP', str(number)+'.xml'), 'a')
        word_file.write(l)
        page = True
        number += 1

    elif '</page>' in l:
        word_file.write(l)
        word_file.close()
        page = False

    elif page:
        word_file.write(l)


word_file.close()
f.close()

Are the smaller files also saved as XML, with the same tag hierarchy? If so, you're bound to have some tag repetition.

ie if you were to split this file:

<root>
    <item>abc</item>
    <item>def</item>
    <item>ghi</item>
</root>

into three separate files:

<root>
    <item>abc</abc>
</root>

<root>
    <item>def</abc>
</root>

<root>
    <item>ghi</abc>
</root>

The <root> tag is repeated in each smaller file.

It gets worse if your data scheme is more complex:

<root>
    <level1>
        <level2>
            <level3>
                <item>abc</item>
            </level3>
        </level2>
    </level1>
</root>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM