简体   繁体   English

正在加载/流式传输8GB txt文件? 并标记化

[英]Loading / Streaming 8GB txt file?? And tokenize

I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. 我有一个相当大的文件(大约8 GB)。.现在我读这篇文章: 如何逐行读取一个大文件,以及如何 使用Python NLTK标记化一个大(> 70MB)的TXT文件。 Concatenation & write data to stream errors 串联并写入数据以流式传输错误

But this still doesnt do the job.. when I run my code, my pc gets stuck. 但这仍然不能完成工作。.当我运行代码时,我的电脑卡住了。 Am I doing something wrong? 难道我做错了什么?

I want to get all words into a list (tokenize them). 我想将所有单词放入列表中(对它们进行标记)。 Further, doesnt the code reads each line and tokenizes the line? 此外,代码不读取每一行并标记该行吗? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line? 因为某些单词(和句子)在一行之后才结束,这是否会阻止分词器正确地分词?

I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file? 我考虑过将其拆分为较小的文件,但是如果我只有8GB Ram,这是否还会占用我的RAM,因为单词列表可能和初始txt文件一样大(8GB)?

word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
    for line in t.readlines():
        word_list+=nltk.word_tokenize(line)
        number = number + 1
        print(number)

By using the following line: 通过使用以下行:

for line in t.readlines():
    # do the things

You are forcing python to read the whole file with t.readlines() , then return an array of strings that represents the whole file, thus bringing the whole file into memory. 您正在强迫python使用t.readlines()读取整个文件,然后返回代表整个文件的字符串数组,从而将整个文件带入内存。

Instead, if you do as the example you linked states: 相反,如果您以示例为例,则链接状态:

for line in t:
    # do the things

The Python VM will natively process the file line-by-line, like you want. Python VM将根据需要本机逐行处理文件。 the file will act like a generator , yielding each line one at a time. 该文件将像生成器一样工作,一次生成每一行。


After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line) . 再次查看您的代码后,我发现您不断使用word_list += nltk.word_tokenize(line)追加到单词列表。 This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. 这意味着即使您确实一次导入了一行文件,即使文件继续移动,您仍将这些数据保留在内存中。 You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory. 您可能需要找到一种更好的方法来执行此操作,因为您仍将消耗大量内存,因为尚未从内存中删除数据。


For data this large, you will have to either 对于如此大的数据,您将要么

  • find a way to store an intermediate version of your tokenized data, or 找到一种方法来存储标记化数据的中间版本,或者
  • design your code in a way that you can handle one, or just a few tokenized words at a time. 设计代码的方式是一次只能处理一个或几个标记化的单词。

Some thing like this might do the trick: 这样的事情可能会解决问题:

def enumerated_tokens(filepath):
    index = 0
    with open(filepath, rb, encoding="utf-8") as t:
        for line in t:
            for word in nltk.word_tokenize(line):
                yield (index, word)
                index += 1

for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
    print(index, word)
    # Do the thing with your word.

Notice how this never actually stores the word anywhere. 请注意,这实际上从未在任何地方存储单词。 This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. 这并不意味着您不能暂时存储任何内容,但是如果您的内存有限,则可以使用生成器。 This approach will likely be faster, more stable, and use less memory overall. 这种方法可能会更快,更稳定并且总体上使用更少的内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM