简体   繁体   English

如何独特的大文本文件内容

[英]how to unique large text file content

I have a text file with 34,686,770 lines. 我有一个34,686,770行的文本文件。 Length of all lines are in between 50 and 250. Some of the lines are appeared more than one. 所有行的长度都在50到250之间。其中一些行不止一个。 I want to make all these lines unique. 我想让所有这些线条都独一无二。

I can't store all these lines in a list to make it unique. 我无法将所有这些行存储在列表中以使其唯一。 How can I do this. 我怎样才能做到这一点。

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.

I have to make the file with unique line. 我必须使用唯一的行生成文件。

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.

How can I do this? 我怎样才能做到这一点?

Without storing all the text in memory: 不将所有文本存储在内存中:

with open('text.txt') as text:
    with open('unique.txt', 'w') as output:
        seen = set()
        for line in text:
            line_hash = hash(line)
            if line_hash not in seen:
                output.write(line)
                seen.add(line_hash)

Instead we are storing a hash of the text, which is much smaller. 相反,我们存储文本的哈希值,这个哈希值要小得多。 Of course, there is a chance of a hash collision, in which case this code would skip a unique line that should be included. 当然,存在哈希冲突的可能性,在这种情况下,此代码将跳过应包含的唯一行。

Use shell tools: 使用shell工具:

$ cat in.txt 
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.
$ sort < in.txt | uniq
I thought the author should have used more dialogue. It reads like a history book.
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.

If you can't load the file to memory, why not split it smartly to smaller files and work there. 如果您无法将文件加载到内存中,为什么不聪明地将其拆分为较小的文件并在那里工作。 You only need to know that identical lines end up in identical files, and you want some collisions to not end up with a huge amount of files. 您只需要知道相同的行最终会出现在相同的文件中,并且您希望某些冲突不会以大量文件结束。

Here is a script that takes the prefix of each sentence (can be changed obviously) and puts the sentence in the file corresponsing to the prefix. 这是一个脚本,它取每个句子的前缀(可以明显改变)并将句子放在与前缀相对应的文件中。

This is actually much like a hash map, only not in memory as your RAM can not handle the amount of data you're tryiing to process. 这实际上很像哈希映射,只是不在内存中,因为你的RAM无法处理你正在尝试处理的数据量。

The result is many smaller files (buckets, if you will..) that will have all occurrences of a line grouped in a certain file (same prefix). 结果是许多较小的文件(存储桶,如果你愿意......),它们会在一个特定的文件(相同的前缀)中出现所有出现的行。 They can be unique-d individually and then merged to the result file. 它们可以是unique-d,然后合并到结果文件中。

Here is how it's done: 以下是它的完成方式:

Initializing the program to read from the file input.txt , and write to output.txt , using a prefix size of 2 to hash/split: 初始化程序以从文件input.txt读取,并写入output.txt ,使用前缀大小2进行散列/分割:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

Create the folder holding the split files containing similar and identical lines: 创建包含包含相似和相同行的拆分文件的文件夹:

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

Line-distributing function - puts a line in a specified file: 行分配功能 - 在指定文件中放置一行:

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

Hash function that promises some collision (which is good), and identical lines being in a similar file: 承诺一些冲突的哈希函数(这是好的),并且相同的行在类似的文件中:

def prefix_hash(line):
    return line[:prefix_size]

Now we distribute lines to their smaller files (like hash "buckets") 现在我们将行分发到它们的较小文件(如哈希“桶”)

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

Generate a list of created file names: 生成创建的文件名列表:

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

De-duplicate lines in the smaller files: 重复删除较小文件中的行:

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

Join smaller files into the result file: 将较小的文件加入结果文件:

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

The whole thing together: 整个过程在一起:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

def prefix_hash(line):
    return line[:prefix_size]

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

Note: To make this a lot faster you should keep the file handlers open at all times and probably make some threads using a queue to pass lines among them (prevents waiting for I/O as well as opening and closing the files). 注意:为了使速度更快,您应该始终保持文件处理程序处于打开状态,并且可能使一些线程使用队列在它们之间传递线路(防止等待I / O以及打开和关闭文件)。 I can add this later if anyone wants it. 如果有人想要,我可以稍后添加。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM