簡體 English 中英

我可以加快將xml bz2文件加載到內存中的速度嗎？

[英]Can I speed up loading xml bz2 files into memory?

原文 2017-06-12 19:05:05 6 1 python/ deep-learning/ gensim

我正在嘗試將英文Wikipedia語料庫（ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ）放入python中進行一些深度學習。 我正在使用gensim。

它是16GB，我已經將它安裝在AWS的大型EC2計算機上。 我加載

from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")

我在jupyter筆記本中運行此程序，但嘗試加載此程序時基本上已掛起。 我正在觀察內存消耗及其加載非常緩慢。 （超過12小時，僅〜2 GB）。 有什么辦法可以加快速度嗎？

1 個解決方案

過去，我在不同的服務器上處理了這個完全相同的文件，並且從未造成任何可觀的延遲，唯一的區別是，我從未為此使用過jupyter筆記本。 因此，我敢怪筆記本。 也許使用命令外殼（或IPython）進行嘗試。

加快讀取壓縮 bz2 文件（'rb' 模式）

[英]Speed up reading in a compressed bz2 file ('rb' mode)

解壓bz2文件

[英]Decompress bz2 files

在 Windows 上解壓 bz2 文件

[英]Decompressing bz2 files on Windows

在Python中從內存解壓縮流式BZ2

[英]Decompress streaming BZ2 from memory in Python

使用內存中的單個文件提取 bz2 文件

[英]Extracting bz2 file with single file in memory

在 python 中讀取 bz2 文件的第一行

[英]Reading first lines of bz2 files in python

在python中將多個文件壓縮成一個bz2文件

[英]compress multiple files into a bz2 file in python

使用python在tar bz2文件中組織文件

[英]Organizing files in tar bz2 file with python

解壓python目錄下的.bz2文件

[英]Decompressing .bz2 files in a directory in python

可以將 python 中的 bz2 解壓縮到文件而不是內存

[英]Possible to decompress bz2 in python to a file instead of memory

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 加快讀取壓縮 bz2 文件（'rb' 模式）解壓bz2文件在 Windows 上解壓 bz2 文件在Python中從內存解壓縮流式BZ2 使用內存中的單個文件提取 bz2 文件在 python 中讀取 bz2 文件的第一行在python中將多個文件壓縮成一個bz2文件使用python在tar bz2文件中組織文件解壓python目錄下的.bz2文件可以將 python 中的 bz2 解壓縮到文件而不是內存

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM