Python：检索文件中逗号分隔数据的最快方法

Question

I have a file of a couple hundred thousand lines which looks like this:我有一个几十万行的文件，看起来像这样：

01,T,None,Red,Big
02,F,None,Purple,Small
03,T,None,Blue,Big
.......

I want something that will retrieve the n th column from the whole file.我想要一些可以从整个文件中检索第n列的东西。 For example, the 4th column would be:例如，第 4 列将是：

Red
Purple
Blue

Since the file is very big, I am interested in knowing the most efficient way to do this.由于文件很大，我很想知道最有效的方法来做到这一点。

The obvious solution would be to go through the file line by line, then apply split(',') and get the 4th item in the array, but I am wondering if there is anything slightly better.显而易见的解决方案是逐行浏览文件，然后应用 split(',') 并获取数组中的第 4 项，但我想知道是否有更好的方法。

Answer 1

I don't think you can improve on just reading the file and using str.split() .我认为你不能仅仅通过阅读文件和使用str.split()来str.split() 。 However, you haven't shown us all your code... you might want to make sure you aren't reading the entire file into memory before working on it (using the file.readlines() method function or file.read() ).但是，您还没有向我们展示您的所有代码……您可能想确保在处理之前没有将整个文件读入内存（使用file.readlines()方法函数或file.read() ）。

Something like this is probably about as good as you can do:像这样的事情可能和你能做的一样好：

with open(filename, "rt") as f:
    for line in f:
        x = line.split(',')[3]
        # do something with x

If you want to be able to treat an input file as if it contained only one column, I suggest wrapping the above in a function that uses yield to provide the values.如果您希望能够将输入文件视为仅包含一列，我建议将上述内容包装在使用yield提供值的函数中。

def get_col3(f):
    for line in f:
        yield line.split(',')[3]

with open(filename, "rt") as f:
    for x in get_col3(f):
        # do something with x

Given that the file I/O stuff is part of the C guts of Python, you probably can't pick up too much extra speed by being tricky.鉴于文件 I/O 内容是 Python 的 C 语言的一部分，您可能无法通过狡猾来获得太多额外的速度。 But you could try writing a simple C program that reads a file, finds the fourth column, and prints it to standard output, then pipe that into a Python program.但是您可以尝试编写一个简单的 C 程序，该程序读取文件，找到第四列，并将其打印到标准输出，然后将其通过管道传输到 Python 程序中。

If you will be working with the same input file a lot, it would probably make sense to save it in some sort of binary file format that is faster than parsing a text file.如果您经常使用相同的输入文件，那么将其保存为某种比解析文本文件更快的二进制文件格式可能是有意义的。 I believe the science guys who work with really large data sets like HDF5, and Python has good support for that through Pandas.我相信那些处理像 HDF5 这样的大数据集的科学人员，Python 通过 Pandas 对此提供了很好的支持。

http://pandas.pydata.org/ http://pandas.pydata.org/

http://www.hdfgroup.org/HDF5/ http://www.hdfgroup.org/HDF5/

Hmm, now that I think about it: you should try using Pandas to import that text file.嗯，现在我想到了：您应该尝试使用 Pandas 导入该文本文件。 I remember the author of Pandas saying he had written some low-level code that greatly accelerated parsing input files.我记得 Pandas 的作者说他写了一些低级代码，大大加快了解析输入文件的速度。

Oh, found it: http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/哦，找到了： http : //wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/

Hmm.嗯。 Looking in the Pandas documentation, it appears you can use read_csv() with an optional argument usecols to specify a subset of columns you want, and it will throw away everything else.查看 Pandas 文档，您似乎可以使用带有可选参数usecols read_csv()来指定您想要的usecols集，它会丢弃其他所有内容。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

The reason I think Pandas might win for speed: when you call line.split(',') , Python will build a string object for each of the columns, plus build a list for you.我认为 Pandas 可能会因速度而获胜的原因是：当您调用line.split(',') ，Python 将为每一列构建一个字符串对象，并为您构建一个列表。 Then you index the list to grab the one string you need, and Python will destroy the list and destroy the objects it created (other than the column you wanted).然后你索引列表以获取你需要的一个字符串，Python 将销毁列表并销毁它创建的对象（除了你想要的列）。 This "churn" in Python's object pool takes some time, and you multiply that time by the number of lines in the file. Python 对象池中的这种“搅动”需要一些时间，您可以将该时间乘以文件中的行数。 Pandas can parse the lines, and return to Python only the lines you need, and it might therefore win. Pandas 可以解析这些行，并只将您需要的行返回给 Python，因此它可能会获胜。

But all this is mere speculation.但这一切都只是猜测。 The rule to speed things up is: measure.加快速度的规则是：衡量。 Run code, measure how fast it is, then run the other code and measure, see if the speedup is worth it.运行代码，测量它的速度，然后运行其他代码并测量，看看加速是否值得。

Answer 2

The csv module is the right way to read a csv file. csv 模块是读取 csv 文件的正确方法。 A generator can help you get the right balance of speed and memory usage for a large file.生成器可以帮助您在大文件的速度和内存使用之间取得适当的平衡。

from csv import reader
def getNthCol(filename, n):
  with open(filename) as afile:
    r = reader(afile)
    for line in r:
      yield r[n]

You may want to adjust n by -1 if you're dead set on a 1-offset for your column number.如果您在列号的 1 偏移量上死机，您可能希望将 n 调整为 -1。

Update更新

Another way that is almost certainly less asymptotically efficient , but might actually be quite fast is to transpose the file and grab a certain line.另一种几乎可以肯定渐进效率较低但实际上可能相当快的方法是转置文件并抓取某一行。

def getNthCol(filename, n):
  with open(filename) as afile:
    return zip(*reader(afile))[n]

Answer 3

I think your suggested method is the best way to go:我认为您建议的方法是最好的方法：

def nth_column(filepath, n):
    n -= 1 # since indices starts at 0
    columns = []
    with open(filepath, 'r') as my_file:
        for line in my_file:
            try: columns.append(line.split(',')[n])
            except IndexError: pass # if the line doesn't have n columns
    return columns

Python：检索文件中逗号分隔数据的最快方法

问题描述

3 个解决方案

解决方案1
6 已采纳 2013-10-18 02:09:03

解决方案2
5 2013-10-18 02:09:23

Update更新

解决方案3
1 2013-10-18 02:09:20

Python：检索文件中逗号分隔数据的最快方法

问题描述

3 个解决方案

解决方案1 6 已采纳 2013-10-18 02:09:03

解决方案2 5 2013-10-18 02:09:23

Update更新

解决方案3 1 2013-10-18 02:09:20

解决方案1
6 已采纳 2013-10-18 02:09:03

解决方案2
5 2013-10-18 02:09:23

解决方案3
1 2013-10-18 02:09:20