I'm trying to pull into python the English Wikipedia corpus ( https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ) to perform some deep learning. I'm using gensim.
It's 16GB and I've got it sitting on a large EC2 machine in AWS. I load it with
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
I run this in a jupyter notebook, but its basically hung trying to load this. I'm watching memory consumption and its loading extremely slowly. (12+ hours and only ~2 GB). Any way I can speed this up?
In the past I have processed this exact same file on different servers and it never caused any considerable delay, with the sole difference that I never used a jupyter notebook for this. I would therefore dare to blame the notebook. Maybe try it out using the command shell (or IPython).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.