如何一次处理 1 亿+ 文本行

Question

我有这段代码可以逐行读取和处理文本文件，问题是我的文本文件有 15 到 20 亿行，而且要花很长时间。 是否可以同时处理超过 1 亿行代码？

from cryptotools.BTC.HD import check, WORDS



with open("input.txt", "r") as a_file:
    for line in a_file:
        stripped_line = line.strip()
        for word in WORDS:
            mnemonic = stripped_line.format(x=word)
            if check(mnemonic):
               print(mnemonic)
               with open("print.txt", "a") as i:
                   i.write(mnemonic)
                   i.write("\n")

输入文件具有以下示例行：

gloom document {x} stomach uncover peasant sock minor decide special roast rural
happy seven {x} gown rally tennis yard patrol confirm actress pledge luggage
tattoo time {x} other horn motor symbol dice update outer fiction sign
govern wire {x} pill valid matter tomato scheme girl garbage action pulp

Answer 1

要一次处理 1 亿行，您必须有 1 亿个线程。 另一种提高代码速度的方法是将工作分配给不同的线程（少于 1 亿个）。
因为文件的写入和读取操作不是异步的，所以最好在程序开始时读取所有文件并在最后写出处理过的数据。 在下面的代码中，我假设您不关心写文件的顺序。 但是，如果顺序很重要，您可以设置一个字典，该字典以特定线程详细说明的当前行的位置值作为键，并在最后进行相应排序。

import concurrent.futures as cf

N_THREADS = 20
result = []

def doWork(data):
    for line in data:
        #do what you have to do
        result.append(mnemonic)

m_input = open("input.txt", "r")
lines = [line for line in m_input]
#the data for the threads will be here
#as a list of rows for each thread
m_data= { i: [] for i in range(0, N_THREADS)} 
for l, n in zip(lines, range(0, len(lines))):
    m_data[n%N_THREADS].append(l)
'''
If you have to trim the number of threads uncomment these lines
m_data= { k:v for k, v in m_data.items() if len(v) != 0}
N_THREADS = N_THREADS if len(m_data) > N_THREADS else len(m_data)
if(N_THREADS == 0): 
    exit()
'''
with cf.ThreadPoolExecutor(max_workers=N_THREADS) as tp:
    for d in m_data.keys():
        tp.submit(doWork,m_data[d])
    
#work done
output = open("print.txt", "w")
for item in result:
    output.write(f"{item}\n")
output.close()

更改您认为最有效的线程数。

编辑（使用 memory 优化）：

上面的代码虽然非常快，但使用了大量的 memory，因为在 memory 中加载了整个文件，然后对其进行处理。

然后你有两个选择：

将您的文件分成多个较小的文件，从我的测试（见下文）使用约 1000 万行的测试文件，该程序实际上运行速度非常快，但使用了多达 1.3 GB 的内存。
使用此处的代码，我一次加载一行并将该行分配给在该行上工作的线程，然后将数据推送到仅负责写入文件的线程。 通过这种方式，memory 的使用率显着下降，但执行时间却增加了。

下面的代码从文件中读取一行（1000 万行，大约 500 MB），然后将该数据发送到管理固定线程数的 class。 目前我每次完成都会产生一个新线程，实际上可以更有效，始终使用相同的线程并为每个线程使用一个队列。 然后我生成一个编写writer线程，它唯一的工作是写入将包含结果的out.txt文件。 在我的测试中，我只读取文本文件并在另一个文件中写入相同的行。
我发现的是以下内容（使用 1000 万行文件）：

原始代码：耗时14.20630669593811秒，使用 1.301 GB （平均使用量）的 ram 和 10% 的 cpu 使用率
更新的代码：它花费了1230.4356942176819秒，使用了4.3 MB （平均使用率）的 ram 和 10% 的 cpu 使用率，内部参数如下面的代码所示。

两个程序使用相同数量的线程获得定时结果。
从这些结果可以明显看出，memory 优化代码在使用更少内存的情况下运行速度明显变慢。 您可以调整内部参数，例如线程数或最大队列大小以提高性能，请记住这会影响 memory 的使用。 经过大量测试后，我建议将文件拆分为多个子文件，以适合您的 memory 并运行代码的原始版本（见上文），因为在我看来，时间和速度之间的权衡根本不合理。
在这里我放了我为 memory cunsumption 优化的代码，但请记住，就线程管理而言，它没有以任何重要的方式进行优化，一个建议是始终使用相同的线程并使用多个队列将数据传递给这些线程.
在这里，我留下了我用来优化 memory 消耗的代码（是的，它比 XD 上面的代码复杂得多，而且可能比它需要的要复杂得多）：


from threading import Thread
import time
import os
import queue

MAX_Q_SIZE = 100000
m_queue = queue.Queue(maxsize=MAX_Q_SIZE)
end_thread = object()

def doWork(data):
    #do your work here, before
    #checking if the queue is full,
    #otherwise when you finish the 
    #queue might be full again
    while m_queue.full():
        time.sleep(0.1)
        pass
    
    m_queue.put(data)

def writer():
    #check if file exists or creates it
    try:
        out = open("out.txt", "r")
        out.close()
    except FileNotFoundError:
        out = open("out.txt", "w")
        out.close()
    out = open("out.txt", "w")
    _end = False
    while True:
        if m_queue.qsize == 0:
            if _end:
                break
            continue
        try:
            item = m_queue.get()
            if item is end_thread:
                out.close()
                _end = True
                break
            global written_lines
            written_lines += 1
            out.write(item)
        except:
            break


class Spawner:
    def __init__(self, max_threads):
        self.max_threads = max_threads
        self.current_threads = [None]*max_threads
        self.active_threads = 0
        self.writer = Thread(target=writer)
        self.writer.start()

    def sendWork(self, data):
        m_thread = Thread(target=doWork, args=(data, ))
        replace_at = -1
        if self.active_threads >= self.max_threads:
            #wait for at least 1 thread to finish
            while True:
                for index in range(self.max_threads):
                    if self.current_threads[index].is_alive() :
                        pass
                    else:
                        self.current_threads[index] = None
                        self.active_threads -= 1
                        replace_at = index
                        break
                if replace_at != -1:
                    break
                #else: no threads have finished, keep waiting
        if replace_at == -1:
            #only if len(current_threads) < max_threads
            for i in range(len(self.current_threads)):
                if self.current_threads[i] == None:
                    replace_at = i
                    break
        self.current_threads[replace_at] = m_thread
        self.active_threads += 1
        m_thread.start()

    def waitEnd(self):
        for t in self.current_threads:
            if t.is_alive():
                t.join()
            self.active_threads -= 1
        while True:
            if m_queue.qsize == MAX_Q_SIZE:
                time.sleep(0.1)
                continue
            m_queue.put(end_thread)
            break
        if self.writer.is_alive():
            self.writer.join()


start_time = time.time()

spawner = Spawner(50)
with open("input.txt", "r") as infile:
    for line in infile:
        spawner.sendWork(line)

spawner.waitEnd()
print("--- %s seconds ---" % (time.time() - start_time))

您可以暂时删除这些打印件，我留下这些只是为了参考，以了解我如何计算程序运行所花费的时间，以及在下面您可以从任务管理器中找到这两个程序的执行屏幕截图。

Memory优化版：
原始版本（我截图的时候忘了展开终端进程，反正终端子进程的memory使用相对于程序使用的可以忽略不计，1.3 GB的ram是准确的）：

如何一次处理 1 亿+ 文本行

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-08-20 15:37:13

编辑（使用 memory 优化）：

如何一次处理 1 亿+ 文本行

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-08-20 15:37:13

编辑（使用 memory 优化）：

解决方案1
2 已采纳 2021-08-20 15:37:13