简体   繁体   English

处理内存中的大量文本数据

[英]Processing huge amount of text data in memory

I am trying to process ~20GB of data on a Ubuntu system having 64 GB of RAM.我正在尝试在具有 64 GB RAM 的 Ubuntu 系统上处理 ~20GB 的数据。

This step is a part of a some preprocessing steps to generate feature vectors for training an ML algo.此步骤是一些预处理步骤的一部分,用于生成用于训练 ML 算法的特征向量。

The original implementation(written by someone in my team) had lists in it.最初的实现(由我团队中的某个人编写)中有列表。 It does not scale up well as we add more training data.随着我们添加更多的训练数据,它不能很好地扩展。 It is something like this.它是这样的。

all_files = glob("./Data/*.*")
file_ls = []

for fi in tqdm(all_files):
    with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
        file_ls.append(f.read())

This runs into a memory error(process gets killed).这会遇到内存错误(进程被杀死)。 So I though I should try out replacing the list based thing with tries所以我虽然我应该尝试用尝试替换基于列表的东西

def insert(word):
    cur_node = trie_root
    for letter in word:
        if letter in cur_node:
            cur_node = cur_node[letter] 
        else:
            cur_node[letter] = {} 
            cur_node = cur_node[letter]
    cur_node[None] = None

trie_root = {}

for fi in tqdm(all_files):
    with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
        insert(f.read().split())

This too gets killed.这也被杀了。 The above is a demo code that I have written to capture the memory footprint of the objects.以上是我编写的一个演示代码,用于捕获对象的内存占用。 The worse part is that the demo code for list runs standalone but the demo code for trie gets killed, leading me to believe that this implementation is worse than the list implementation.更糟糕的是,list 的演示代码是独立运行的,但是 trie 的演示代码被杀死了,这让我相信这个实现比列表实现更糟糕。

My goal is to write some efficient code in Python to resolve this issue.我的目标是用 Python 编写一些有效的代码来解决这个问题。

Kindly help me solve this problem.请帮我解决这个问题。

EDIT: Responding to @Paul Hankin, the data processing involves first taking up each file and adding a generic placeholder for terms with a normalized term frequency greater than 0.01 after which each file is splitted into a list and a vocabulary is calculated taking all the processed files into consideration.编辑:响应@Paul Hankin,数据处理涉及首先占用每个文件并为标准化术语频率大于 0.01 的术语添加通用占位符,然后将每个文件拆分为一个列表,并计算所有处理过的词汇表文件考虑在内。

One of the simple solutions to this problem might be to NOT store data in a list or any data structure.这个问题的简单解决方案之一可能是不将数据存储在列表或任何数据结构中。 You can try writing these data to a file while doing the reading.您可以在读取时尝试将这些数据写入文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM