提高讀取大型CSV文件的效率

Question

我正在使用rake（Rapid automatics關鍵字提取算法）來生成關鍵字。 我有大約5300萬條記錄，大約4.6gb。 我想知道最好的方法來做到這一點。

我把rake很好地包裹在課堂上。 我有一個4.5GB的文件，其中包含5300萬條記錄。 以下是一些方法。

方法＃1：

with open("~inputfile.csv") as fd:
   for line in fd:
      keywords = rake.run(line)
      write(keywords)

這是一種基本的蠻力方式。 假設寫入文件需要花費時間，調用它5300萬次將是昂貴的。 我使用了以下方法，一次寫入100K行文件。

方法＃2

with open("~inputfile.csv") as fd:
temp_string = ''
counter = 0
   for line in fd:
      keywords = rake.run(line)
      string = string + keywords + '\n'
      counter += 1
      if counter == 100000:
           write(string)
           string = ''

令我驚訝的是，方法＃2花費的時間多於方法＃1。 我不明白！ 怎么可能？ 你們也可以建議一個更好的方法嗎？

方法＃3 （感謝cefstat）

with open("~inputfile.csv") as fd:
  strings = []
  counter = 0
  for line in fd:
    strings.append(rake.run(line))
    counter += 1
    if counter == 100000:
      write("\n".join(strings))
      write("\n")
      strings = []

運行速度比方法＃1和＃2快。

提前致謝！

Answer 1

正如評論中所提到的，Python已經緩沖了對文件的寫入，所以你在Python中實現自己的（與C相反，就像它已經存在的那樣）會讓它變得更慢。 您可以使用要打開的調用的參數調整緩沖區大小。

另一種方法是以塊的形式讀取文件。 基本算法是這樣的：

使用file.seek(x)迭代文件，其中x =當前位置+所需的塊大小
迭代時，記錄每個塊的起始端字節位置
在worker處理（使用multiprocessing.Pool（））時，使用開始和結束字節位置讀入塊
每個進程都寫入自己的關鍵字文件
協調單獨的文件。 你有幾個選擇：
- 將關鍵字文件讀回內存到單個列表中
- 如果在* nix上，請使用“cat”命令組合關鍵字文件。
- 如果您使用的是Windows，則可以保留關鍵字文件列表（而不是一個文件路徑）並根據需要迭代這些文件

有許多博客和食譜可以並行讀取大文件：

https://stackoverflow.com/a/8717312/2615940
http://aamirhussain.com/2013/10/02/parsing-large-csv-files-in-python/
http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/
http://effbot.org/zone/wide-finder.htm

旁注：我曾嘗試做同樣的事情並得到相同的結果。 它也無助於將文件寫入另一個線程（至少在我嘗試時沒有）。

這是一個演示算法的代碼片段：

import functools
import multiprocessing

BYTES_PER_MB = 1048576

# stand-in for whatever processing you need to do on each line
# for demonstration, we'll just grab the first character of every non-empty line
def line_processor(line):
    try:
        return line[0]
    except IndexError:
        return None

# here's your worker function that executes in a worker process
def parser(file_name, start, end):

    with open(file_name) as infile:

        # get to proper starting position
        infile.seek(start)

        # use read() to force exactly the number of bytes we want
        lines = infile.read(end - start).split("\n")

    return [line_processor(line) for line in lines]

# this function splits the file into chunks and returns the start and end byte
# positions of each chunk
def chunk_file(file_name):

    chunk_start = 0
    chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size

    with open(file_name) as infile:

        # we can't use the 'for line in infile' construct because fi.tell()
        # is not accurate during that kind of iteration

        while True:
            # move chunk end to the end of this chunk
            chunk_end = chunk_start + chunk_size
            infile.seek(chunk_end)

            # reading a line will advance the FP to the end of the line so that
            # chunks don't break lines
            line = infile.readline()

            # check to see if we've read past the end of the file
            if line == '':
                yield (chunk_start, chunk_end)
                break

            # adjust chunk end to ensure it didn't break a line
            chunk_end = infile.tell()

            yield (chunk_start, chunk_end)

            # move starting point to the beginning of the new chunk
            chunk_start = chunk_end

    return

if __name__ == "__main__":

    pool = multiprocessing.Pool()

    keywords = []

    file_name = # enter your file name here

    # bind the file name argument to the parsing function so we dont' have to
    # explicitly pass it every time
    new_parser = functools.partial(parser, file_name)

    # chunk out the file and launch the subprocesses in one step
    for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):

        # as each list is available, extend the keyword list with the new one
        # there are definitely faster ways to do this - have a look at 
        # itertools.chain() for other ways to iterate over or combine your
        # keyword lists
        keywords.extend(keyword_list) 

    # now do whatever you need to do with your list of keywords

Answer 2

在Python中反復添加字符串非常慢（如jedwards所述）。 您可以嘗試以下標准替代方案，它幾乎肯定會比＃2更快，並且在我的有限測試中看起來比方法＃1快30％（盡管可能不夠快，無法滿足您的需求）：

with open("~inputfile.csv") as fd:
  strings = []
  counter = 0
  for line in fd:
    strings.append(rake.run(line))
    counter += 1
    if counter == 100000:
      write("\n".join(strings))
      write("\n")
      strings = []

提高讀取大型CSV文件的效率

問題描述

2 個解決方案

解決方案1
3 已采納 2015-03-09 20:48:37

解決方案2
3 2015-03-09 21:06:57

提高讀取大型CSV文件的效率

問題描述

2 個解決方案

解決方案1 3 已采納 2015-03-09 20:48:37

解決方案2 3 2015-03-09 21:06:57

解決方案1
3 已采納 2015-03-09 20:48:37

解決方案2
3 2015-03-09 21:06:57