Can I speed up loading xml bz2 files into memory?

Question

I'm trying to pull into python the English Wikipedia corpus ( https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ) to perform some deep learning. I'm using gensim.

It's 16GB and I've got it sitting on a large EC2 machine in AWS. I load it with

from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")

I run this in a jupyter notebook, but its basically hung trying to load this. I'm watching memory consumption and its loading extremely slowly. (12+ hours and only ~2 GB). Any way I can speed this up?

Answer 1

In the past I have processed this exact same file on different servers and it never caused any considerable delay, with the sole difference that I never used a jupyter notebook for this. I would therefore dare to blame the notebook. Maybe try it out using the command shell (or IPython).

Can I speed up loading xml bz2 files into memory?

Question

1 answers

solution1
1 2017-06-22 09:48:08

Can I speed up loading xml bz2 files into memory?

Question

1 answers

solution1 1 2017-06-22 09:48:08

solution1
1 2017-06-22 09:48:08