[英]How to count word frequencies in a huge file concurrently?
I need to count word frequency of a 3GB gzipped plain text file of English sentences, which is about 30 GB when unzipped. 我需要计算一个3GB gzip压缩纯文本文件的英文单词的单词频率,解压后大约30 GB。
I have a single thread script with collections.Counter
and gzip.open
, it takes hours to finish. 我有一个带有
collections.Counter
和gzip.open
单线程脚本,需要几个小时才能完成。
Since reading a file line by line is much faster than split and counting, I am thinking about a producer-consumer flow with a file reader to produce lines and several consumers to do the split and counting, and in the end, merge the Counter
s to get the word occurrence. 由于逐行读取文件比分割和计数快得多,我正在考虑生产者 - 消费者流程与文件阅读器生成行和几个消费者进行拆分和计数,最后合并
Counter
得到单词出现。
However, I cannot find an example for ProcessPoolExecutor
to send a queue to Executor
, they just map
single item from a list. 但是,我找不到
ProcessPoolExecutor
向Executor
发送队列的示例,它们只是从列表中map
单个项目。 There are only single threaded examples for asyncio.Queue
. asyncio.Queue
只有单线程示例。
It is a huge file, so I cannot read the whole file and get the list
before counting, thus I cannot use concurrent.futures.Executor.map
. 它是一个巨大的文件,所以我无法读取整个文件并在计数之前获取
list
,因此我不能使用concurrent.futures.Executor.map
。 But all examples I read use a fixed list as start. 但我读过的所有例子都使用固定列表作为开头。
The time to splitting and counting one sentence is comparable to fork a process, so I have to make each consumer process lives longer. 拆分和计算一个句子的时间与fork一个进程相当,所以我必须让每个消费者进程的寿命更长。 I do not think the
map
can merge Counter
s, so I cannot use chunksize
>1. 我不认为
map
可以合并Counter
,所以我不能使用chunksize
> 1。 Thus I have to give the consumer a queue and make them keep counting until the whole file is finished. 因此,我必须给消费者一个队列,让他们继续计数,直到整个文件完成。 But most examples only send one item to consumer and use
chunksize=1000
to reduce fork
times. 但是大多数示例只向消费者发送一个项目并使用
chunksize=1000
来减少fork
时间。
Would you write an example for me ? 你会为我写一个例子吗?
I hope the code is backward compatible with Python 3.5.3, since PyPy is faster. 我希望代码向后兼容Python 3.5.3,因为PyPy更快。
My real case is for a more specific file format: 我的真实情况是更具体的文件格式:
chr1 10011 141 0 157 4 41 50
chr1 10012 146 1 158 4 42 51
chr1 10013 150 0 163 4 43 53
chr1 10014 164 3 167 4 44 54
I need to count each histogram for single columns form column 3 to 8. So I take word frequencies as an easier example. 我需要计算从第3列到第8列的单列的每个直方图。因此,我将单词频率作为一个更简单的示例。
My code is: 我的代码是:
#!/usr/bin/env pypy3
import sys
SamplesList = ('D_Crick', 'D_Watson', 'Normal_Crick', 'Normal_Watson', 'D_WGS', 'Normal_WGS')
def main():
import math
if len(sys.argv) < 3 :
print('Usage:',sys.argv[0],'<samtools.depth.gz> <out.tsv> [verbose=0]',file=sys.stderr,flush=True)
exit(0)
try:
verbose = int(sys.argv[3])
except: # `except IndexError:` and `except ValueError:`
verbose = 0
inDepthFile = sys.argv[1]
outFile = sys.argv[2]
print('From:[{}], To:[{}].\nVerbose: [{}].'.format(inDepthFile,outFile,verbose),file=sys.stderr,flush=True)
RecordCnt,MaxDepth,cDepthCnt,cDepthStat = inStat(inDepthFile,verbose)
for k in SamplesList:
cDepthStat[k][2] = cDepthStat[k][0] / RecordCnt # E(X)
cDepthStat[k][3] = cDepthStat[k][1] / RecordCnt # E(X^2)
cDepthStat[k][4] = math.sqrt(cDepthStat[k][3] - cDepthStat[k][2]*cDepthStat[k][2]) # E(X^2)-E(X)^2
tsvout = open(outFile, 'wt')
print('#{}\t{}'.format('Depth','\t'.join(SamplesList)),file=tsvout)
#RecordCntLength = len(str(RecordCnt))
print( '#N={},SD:\t{}'.format(RecordCnt,'\t'.join(str(round(cDepthStat[col][4],1)) for col in SamplesList)),file=tsvout)
for depth in range(0,MaxDepth+1):
print( '{}\t{}'.format(depth,'\t'.join(str(cDepthCnt[col][depth]) for col in SamplesList)),file=tsvout)
tsvout.close()
pass
def inStat(inDepthFile,verbose):
import gzip
import csv
from collections import Counter
# Looking up things in global scope takes longer then looking up stuff in local scope. <https://stackoverflow.com/a/54645851/159695>
cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0,0,0,0] for key in SamplesList} # x and x^2
RecordCnt = 0
MaxDepth = 0
with gzip.open(inDepthFile, 'rt') as tsvin:
tsvin = csv.DictReader(tsvin, delimiter='\t', fieldnames=('ChrID','Pos')+SamplesList )
try:
for row in tsvin:
RecordCnt += 1
for k in SamplesList:
theValue = int(row[k])
if theValue > MaxDepth:
MaxDepth = theValue
cDepthCnt[k][theValue] += 1 # PyPy3:29.82 ns, Python3:30.61 ns
cDepthStat[k][0] += theValue
cDepthStat[k][1] += theValue * theValue
#print(MaxDepth,DepthCnt)
except KeyboardInterrupt:
print('\n[!]Ctrl+C pressed.',file=sys.stderr,flush=True)
pass
print('[!]Lines Read:[{}], MaxDepth is [{}].'.format(RecordCnt,MaxDepth),file=sys.stderr,flush=True)
return RecordCnt,MaxDepth,cDepthCnt,cDepthStat
if __name__ == "__main__":
main() # time python3 ./samdepthplot.py t.tsv.gz 1
csv.DictReader
takes most time. csv.DictReader
占用大部分时间。
My problem is, although gzip reader is fast, csv reader is fast, I need count billions of lines. 我的问题是,虽然gzip阅读器很快,但csv阅读器速度很快,我需要数十亿行。 And csv reader is sure being SLOWER than gzip reader.
而csv阅读器肯定比gzip阅读器更低。
So, I need to spread lines to different worker processes of csv reader and do downstream counting separately. 因此,我需要将行传播到csv阅读器的不同工作进程,并分别进行下游计数。 It is convenient to use a queue between one producer and many consumers.
在一个生产者和许多消费者之间使用队列很方便。
Since I am using Python, not C, is there some abstracted wrapper for multiprocessing and queue ? 由于我使用的是Python,而不是C,是否有一些用于多处理和队列的抽象包装器? Is this possible to use
ProcessPoolExecutor
with the Queue
class ? 是否可以将
ProcessPoolExecutor
与Queue
类一起使用?
I've never tested this code, but should work. 我从未测试过这段代码,但应该可以使用。
The first thing is to check the number of lines 首先要检查行数
f =('myfile.txt')
def file_len(f):
with open(f) as f:
for i, l in enumerate(f):
pass
return i + 1
num_lines = file_len(f)
split the data in n partitions 将数据拆分为n个分区
n = threads (8 for example)
split_size = num_lines//n if num_lines//n > 0 else 1
parts = [x for x in range(0, num_lines, split_size)]
And now start the jobs: 现在开始工作:
from multiprocessing import Process
import linecache
jobs = []
for part in range(len(parts)):
p = Process(target = function_here, args = ('myfile.txt', parts[part], split_size))
jobs.append(p)
p.start()
for p in jobs:
p.join()
An example of the function 功能的一个例子
def function_here(your_file_name, line_number, split_size):
for current_line in range(line_number, (line_number+split_size)+1):
print( linecache.getline(your_file_name, current_line))
Still, you will need to check the number of lines before doing any operation 不过,在进行任何操作之前,您需要检查行数
A 30 GB text file is big enough to put your question into the realm of Big-Data. 一个30 GB的文本文件足以将您的问题放入大数据领域。 So to tackle this problem I suggest using Big-Data tools like Hadoop and Spark.
因此,为了解决这个问题,我建议使用Hadoop和Spark等大数据工具。 What you explained as a "producer-consumer flow" is basically what
MapReduce
algorithm is designed for. 你所解释的“生产者 - 消费者流”基本上就是
MapReduce
算法的设计目的。 The word count frequency is a typical MapReduce problem. 字计数频率是典型的MapReduce问题。 Look it up, you will find tons of examples.
查一下,你会发现很多例子。
The idea is to break the huge file into smaller files. 我们的想法是将大文件分成更小的文件。 Invoke many workers that will do the count job and return a Counter.
调用许多将执行计数工作并返回计数器的工作人员。 Finally merge the counters.
最后合并计数器。
from itertools import islice
from multiprocessing import Pool
from collections import Counter
import os
NUM_OF_LINES = 3
INPUT_FILE = 'huge.txt'
POOL_SIZE = 10
def slice_huge_file():
cnt = 0
with open(INPUT_FILE) as f:
while True:
next_n_lines = list(islice(f, NUM_OF_LINES))
cnt += 1
if not next_n_lines:
break
with open('sub_huge_{}.txt'.format(cnt), 'w') as out:
out.writelines(next_n_lines)
def count_file_words(input_file):
with open(input_file, 'r') as f:
return Counter([w.strip() for w in f.readlines()])
if __name__ == '__main__':
slice_huge_file()
pool = Pool(POOL_SIZE)
sub_files = [os.path.join('.',f) for f in os.listdir('.') if f.startswith('sub_huge')]
results = pool.map(count_file_words, sub_files)
final_counter = Counter()
for counter in results:
final_counter += counter
print(final_counter)
just some pseudocode: 只是一些伪代码:
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
import traceback
WORKER_POOL_SIZE = 10 # you should set this as the number of your processes
QUEUE_SIZE = 100 # 10 times to your pool size is good enough
def main():
with Manager() as manager:
q = manager.Queue(QUEUE_SIZE)
# init worker pool
executor = ProcessPoolExecutor(max_workers=WORKER_POOL_SIZE)
workers_pool = [executor.submit(worker, i, q) for i in range(WORKER_POOL_SIZE)]
# start producer
run_producer(q)
# wait to done
for f in workers_pool:
try:
f.result()
except Exception:
traceback.print_exc()
def run_producer(q):
try:
with open("your file path") as fp:
for line in fp:
q.put(line)
except Exception:
traceback.print_exc()
finally:
q.put(None)
def worker(i, q):
while 1:
line = q.get()
if line is None:
print(f'worker {i} is done')
q.put(None)
return
# do something with this line
# ...
I have learned the multiprocessing lib on weekend. 我周末学会了多处理库。
The stop on Ctrl+C and write current result function is still not working. Ctrl + C上的停止和写入当前结果功能仍然无法正常工作。
The main function is fine now. 主要功能现在很好。
#!/usr/bin/env pypy3
import sys
from collections import Counter
from multiprocessing import Pool, Process, Manager, current_process, freeze_support
SamplesList = ('D_Crick', 'D_Watson', 'Normal_Crick', 'Normal_Watson', 'D_WGS', 'Normal_WGS')
ChunkSize = 1024 * 128
verbose = 0
Nworkers = 16
def main():
import math
if len(sys.argv) < 3 :
print('Usage:',sys.argv[0],'<samtools.depth.gz> <out.tsv> [verbose=0]',file=sys.stderr,flush=True)
exit(0)
try:
verbose = int(sys.argv[3])
except: # `except IndexError:` and `except ValueError:`
verbose = 0
inDepthFile = sys.argv[1]
outFile = sys.argv[2]
print('From:[{}], To:[{}].\nVerbose: [{}].'.format(inDepthFile,outFile,verbose),file=sys.stderr,flush=True)
RecordCnt,MaxDepth,cDepthCnt,cDepthStat = CallStat(inDepthFile)
for k in SamplesList:
cDepthStat[k][2] = cDepthStat[k][0] / RecordCnt # E(X)
cDepthStat[k][3] = cDepthStat[k][1] / RecordCnt # E(X^2)
cDepthStat[k][4] = math.sqrt(cDepthStat[k][3] - cDepthStat[k][2]*cDepthStat[k][2]) # E(X^2)-E(X)^2
tsvout = open(outFile, 'wt')
print('#{}\t{}'.format('Depth','\t'.join(SamplesList)),file=tsvout)
#RecordCntLength = len(str(RecordCnt))
print( '#N={},SD:\t{}'.format(RecordCnt,'\t'.join(str(round(cDepthStat[col][4],1)) for col in SamplesList)),file=tsvout)
for depth in range(0,MaxDepth+1):
#print( '{}\t{}'.format(depth,'\t'.join(str(DepthCnt[col][depth]) for col in SamplesList)) )
#print( '{}\t{}'.format(depth,'\t'.join(str(yDepthCnt[depth][col]) for col in SamplesList)) )
print( '{}\t{}'.format(depth,'\t'.join(str(cDepthCnt[col][depth]) for col in SamplesList)),file=tsvout)
#pass
#print('#MaxDepth={}'.format(MaxDepth),file=tsvout)
tsvout.close()
pass
def CallStat(inDepthFile):
import gzip
import itertools
RecordCnt = 0
MaxDepth = 0
cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0,0,0,0] for key in SamplesList} # x and x^2
#lines_queue = Queue()
manager = Manager()
lines_queue = manager.Queue()
stater_pool = Pool(Nworkers)
TASKS = itertools.repeat((lines_queue,SamplesList),Nworkers)
#ApplyResult = [stater_pool.apply_async(iStator,x) for x in TASKS]
#MapResult = stater_pool.map_async(iStator,TASKS,1)
AsyncResult = stater_pool.imap_unordered(iStator,TASKS,1)
try:
with gzip.open(inDepthFile, 'rt') as tsvfin:
while True:
lines = tsvfin.readlines(ChunkSize)
lines_queue.put(lines)
if not lines:
for i in range(Nworkers):
lines_queue.put(b'\n\n')
break
except KeyboardInterrupt:
print('\n[!]Ctrl+C pressed.',file=sys.stderr,flush=True)
for i in range(Nworkers):
lines_queue.put(b'\n\n')
pass
#for results in ApplyResult:
#(iRecordCnt,iMaxDepth,icDepthCnt,icDepthStat) = results.get()
#for (iRecordCnt,iMaxDepth,icDepthCnt,icDepthStat) in MapResult.get():
for (iRecordCnt,iMaxDepth,icDepthCnt,icDepthStat) in AsyncResult:
RecordCnt += iRecordCnt
if iMaxDepth > MaxDepth:
MaxDepth = iMaxDepth
for k in SamplesList:
cDepthCnt[k].update(icDepthCnt[k])
cDepthStat[k][0] += icDepthStat[k][0]
cDepthStat[k][1] += icDepthStat[k][1]
return RecordCnt,MaxDepth,cDepthCnt,cDepthStat
#def iStator(inQueue,inSamplesList):
def iStator(args):
(inQueue,inSamplesList) = args
import csv
# Looking up things in global scope takes longer then looking up stuff in local scope. <https://stackoverflow.com/a/54645851/159695>
cDepthCnt = {key:Counter() for key in inSamplesList}
cDepthStat = {key:[0,0] for key in inSamplesList} # x and x^2
RecordCnt = 0
MaxDepth = 0
for lines in iter(inQueue.get, b'\n\n'):
try:
tsvin = csv.DictReader(lines, delimiter='\t', fieldnames=('ChrID','Pos')+inSamplesList )
for row in tsvin:
#print(', '.join(row[col] for col in inSamplesList))
RecordCnt += 1
for k in inSamplesList:
theValue = int(row[k])
if theValue > MaxDepth:
MaxDepth = theValue
#DepthCnt[k][theValue] += 1 # PyPy3:30.54 ns, Python3:22.23 ns
#yDepthCnt[theValue][k] += 1 # PyPy3:30.47 ns, Python3:21.50 ns
cDepthCnt[k][theValue] += 1 # PyPy3:29.82 ns, Python3:30.61 ns
cDepthStat[k][0] += theValue
cDepthStat[k][1] += theValue * theValue
#print(MaxDepth,DepthCnt)
except KeyboardInterrupt:
print('\n[!]Ctrl+C pressed.',file=sys.stderr,flush=True)
pass
#print('[!]{} Lines Read:[{}], MaxDepth is [{}].'.format(current_process().name,RecordCnt,MaxDepth),file=sys.stderr,flush=True)
return RecordCnt,MaxDepth,cDepthCnt,cDepthStat
if __name__ == "__main__":
main() # time python3 ./samdepthplot.py t.tsv.gz 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.