简体   繁体   中英

Can I speed up loading xml bz2 files into memory?

I'm trying to pull into python the English Wikipedia corpus ( https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ) to perform some deep learning. I'm using gensim.

It's 16GB and I've got it sitting on a large EC2 machine in AWS. I load it with

from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")

I run this in a jupyter notebook, but its basically hung trying to load this. I'm watching memory consumption and its loading extremely slowly. (12+ hours and only ~2 GB). Any way I can speed this up?

In the past I have processed this exact same file on different servers and it never caused any considerable delay, with the sole difference that I never used a jupyter notebook for this. I would therefore dare to blame the notebook. Maybe try it out using the command shell (or IPython).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM