[英]Improving Efficiency of Reading a Large CSV File
我正在使用rake(Rapid automatics關鍵字提取算法)來生成關鍵字。 我有大約5300萬條記錄,大約4.6gb。 我想知道最好的方法來做到這一點。
我把rake很好地包裹在課堂上。 我有一個4.5GB的文件,其中包含5300萬條記錄。 以下是一些方法。
方法#1:
with open("~inputfile.csv") as fd:
for line in fd:
keywords = rake.run(line)
write(keywords)
這是一種基本的蠻力方式。 假設寫入文件需要花費時間,調用它5300萬次將是昂貴的。 我使用了以下方法,一次寫入100K行文件。
方法#2
with open("~inputfile.csv") as fd:
temp_string = ''
counter = 0
for line in fd:
keywords = rake.run(line)
string = string + keywords + '\n'
counter += 1
if counter == 100000:
write(string)
string = ''
令我驚訝的是,方法#2花費的時間多於方法#1。 我不明白! 怎么可能? 你們也可以建議一個更好的方法嗎?
方法#3 (感謝cefstat)
with open("~inputfile.csv") as fd:
strings = []
counter = 0
for line in fd:
strings.append(rake.run(line))
counter += 1
if counter == 100000:
write("\n".join(strings))
write("\n")
strings = []
運行速度比方法#1和#2快。
提前致謝!
正如評論中所提到的,Python已經緩沖了對文件的寫入,所以你在Python中實現自己的(與C相反,就像它已經存在的那樣)會讓它變得更慢。 您可以使用要打開的調用的參數調整緩沖區大小。
另一種方法是以塊的形式讀取文件。 基本算法是這樣的:
file.seek(x)
迭代文件,其中x =當前位置+所需的塊大小 每個進程都寫入自己的關鍵字文件
協調單獨的文件。 你有幾個選擇:
有許多博客和食譜可以並行讀取大文件:
https://stackoverflow.com/a/8717312/2615940
http://aamirhussain.com/2013/10/02/parsing-large-csv-files-in-python/
http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/
http://effbot.org/zone/wide-finder.htm
旁注:我曾嘗試做同樣的事情並得到相同的結果。 它也無助於將文件寫入另一個線程(至少在我嘗試時沒有)。
這是一個演示算法的代碼片段:
import functools
import multiprocessing
BYTES_PER_MB = 1048576
# stand-in for whatever processing you need to do on each line
# for demonstration, we'll just grab the first character of every non-empty line
def line_processor(line):
try:
return line[0]
except IndexError:
return None
# here's your worker function that executes in a worker process
def parser(file_name, start, end):
with open(file_name) as infile:
# get to proper starting position
infile.seek(start)
# use read() to force exactly the number of bytes we want
lines = infile.read(end - start).split("\n")
return [line_processor(line) for line in lines]
# this function splits the file into chunks and returns the start and end byte
# positions of each chunk
def chunk_file(file_name):
chunk_start = 0
chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size
with open(file_name) as infile:
# we can't use the 'for line in infile' construct because fi.tell()
# is not accurate during that kind of iteration
while True:
# move chunk end to the end of this chunk
chunk_end = chunk_start + chunk_size
infile.seek(chunk_end)
# reading a line will advance the FP to the end of the line so that
# chunks don't break lines
line = infile.readline()
# check to see if we've read past the end of the file
if line == '':
yield (chunk_start, chunk_end)
break
# adjust chunk end to ensure it didn't break a line
chunk_end = infile.tell()
yield (chunk_start, chunk_end)
# move starting point to the beginning of the new chunk
chunk_start = chunk_end
return
if __name__ == "__main__":
pool = multiprocessing.Pool()
keywords = []
file_name = # enter your file name here
# bind the file name argument to the parsing function so we dont' have to
# explicitly pass it every time
new_parser = functools.partial(parser, file_name)
# chunk out the file and launch the subprocesses in one step
for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):
# as each list is available, extend the keyword list with the new one
# there are definitely faster ways to do this - have a look at
# itertools.chain() for other ways to iterate over or combine your
# keyword lists
keywords.extend(keyword_list)
# now do whatever you need to do with your list of keywords
在Python中反復添加字符串非常慢(如jedwards所述)。 您可以嘗試以下標准替代方案,它幾乎肯定會比#2更快,並且在我的有限測試中看起來比方法#1快30%(盡管可能不夠快,無法滿足您的需求):
with open("~inputfile.csv") as fd:
strings = []
counter = 0
for line in fd:
strings.append(rake.run(line))
counter += 1
if counter == 100000:
write("\n".join(strings))
write("\n")
strings = []
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.