简体   繁体   English

如何从大型文本文件构建数据集而不会出现 memory 错误?

[英]How to build a dataset from a large text file without getting a memory error?

I have a text file with size > 7.02 GB.我有一个大小 > 7.02 GB 的文本文件。 I have already built a tokenizer based on this text file.我已经基于这个文本文件构建了一个标记器。 I want to build a dataset like so:我想像这样构建一个数据集:

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="data.txt", block_size=128,)

Since the size of my data is very large, a memory error occurs.由于我的数据量很大,所以会出现 memory 错误。 This is the source code:这是源代码:

with open(file_path, encoding="utf-8") as f:
        lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

    batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
    print(batch_encoding)
    self.examples = batch_encoding["input_ids"]
    self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]

Supposing that my text file has only 4 lines, the following will be printed:假设我的文本文件只有 4 行,将打印以下内容:

{'input_ids': [[49, 93, 1136, 1685, 973, 363, 72, 3130, 16502, 18], [44, 73, 1685, 279, 7982, 18, 225], [56, 13005, 1685, 4511, 3450, 18], [56, 19030, 1685, 7544, 18]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

I have changed the source code as the following so that the memory error doesn't appear:我已将源代码更改如下,以便不会出现 memory 错误:

for line in open(file_path, encoding="utf-8"):
        if (len(line) > 0 and not line.isspace()):
            new_line = line.split()

            batch_encoding = tokenizer(new_line, add_special_tokens=True, truncation=True, max_length=block_size)
            print(batch_encoding)
            print(type(batch_encoding))
            self.examples = batch_encoding["input_ids"]
            self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]
print(batch_encoding)

However, the following will be printed:但是,将打印以下内容:

{'input_ids': [[49, 93], [3074], [329], [2451, 363, 72, 3130, 16502, 18]], 'token_type_ids': [[0, 0], [0], [0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1], [1], [1], [1, 1, 1, 1, 1, 1]]}
<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [[44, 73], [329], [69], [23788, 18]], 'token_type_ids': [[0, 0], [0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1], [1, 1]]}
<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [[56, 13005], [329], [7522], [7958, 18]], 'token_type_ids': [[0, 0], [0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1], [1, 1]]}
<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [[56, 19030], [329], [11639, 18]], 'token_type_ids': [[0, 0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1, 1]]}
{'input_ids': [[56, 19030], [329], [11639, 18]], 'token_type_ids': [[0, 0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1, 1]]}

How can I change the source code in order to be able to read the large text file line by line but get the same output as desired without a memory error?如何更改源代码以便能够逐行读取大文本文件,但获得相同的 output 而不会出现 memory 错误?

You can create a dictionary storing the byte offsets for each line of the .txt file:您可以为.txt文件的每一行创建一个存储字节偏移量的字典

offset_dict = {}

with open(large_file_path, 'rb') as f:
    f.readline()  # move over header
    for line in range(number_of_lines):
        offset = f.tell()
            offset_dict[line] = offset

and then implement your own hashed __getitem__ method in a PyTorch Dataset (which can then be accessed by a DataLoader):然后在 PyTorch 数据集(然后可以通过 DataLoader 访问)中实现您自己的散列__getitem__方法:

class ExampleDataset(Dataset):
    def __init__(self, large_file_path, offset_dict, ):
        self.large_file_path = large_file_path
        self.offset_dict = offset_dict
    
    def __len__(self):
        return len(self.offset_dict)
    
    def __getitem__(self, line):
        offset = self.offset_dict[line]
        with open(self.large_file_path, 'r', encoding='utf-8') as f:
            f.seek(offset)
            line = f.readline()
            return line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不出现 memory 错误的情况下将大量项目添加到列表中? - How to add large number of items to list without getting memory error? 使用BeautifulSoup但从大文件中获取内存错误 - Using BeautifulSoup but getting memory error from large file 如何从太大而无法存储的文件中构建(或预先计算)直方图? - How to build (or precompute) a histogram from a file too large for memory? 内存错误:训练大型数据集 - Memory Error : Training large dataset Psycopg-从PostgreSQL选择大型数据集时出现内存错误 - Psycopg - Memory error when selecting a large dataset from PostgreSQL 在 tfidf 上训练大型数据集时出现内存错误 - Memory Error when training large dataset on tfidf 使用大型数据集时发生内存错误 - Memory Error occurs while working with large dataset 在Python中解析大型XML文件时出现内存错误 - Getting a memory error when parsing a large XML file in Python 如何在不占用所有内存的情况下使用 python-gnupg 加密大型数据集? - How to encrypt a large dataset using python-gnupg without sucking up all the memory? 如何高效地将大型数据集从 Oracle 提取到文件中? - How to efficiently extract large dataset from Oracle to a file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM