简体   繁体   English

在python中解析大型.csv的最有效方法?

[英]Most efficient way to parse a large .csv in python?

I tried to look on other answers but I am still not sure the right way to do this. 我试着寻找其他答案,但我仍然不确定正确的方法。 I have a number of really large .csv files (could be a gigabyte each), and I want to first get their column labels, cause they are not all the same, and then according to user preference extract some of this columns with some criteria. 我有一些非常大的.csv文件(每个可能是一个千兆字节),我想先得到他们的列标签,因为它们不是全部相同,然后根据用户偏好提取一些这些列的一些标准。 Before I start the extraction part I did a simple test to see what is the fastest way to parse this files and here is my code: 在我开始提取部分之前,我做了一个简单的测试,看看解析这些文件的最快方法是什么,这是我的代码:

def mmapUsage():
    start=time.time()
    with open("csvSample.csv", "r+b") as f:
        # memory-mapInput the file, size 0 means whole file
        mapInput = mmap.mmap(f.fileno(), 0)
        # read content via standard file methods
        L=list()
        for s in iter(mapInput.readline, ""):
            L.append(s)
        print "List length: " ,len(L)
        #print "Sample element: ",L[1]
        mapInput.close()
        end=time.time()
        print "Time for completion",end-start

def fileopenUsage():
    start=time.time()
    fileInput=open("csvSample.csv")
    M=list()
    for s in fileInput:
            M.append(s)
    print "List length: ",len(M)
    #print "Sample element: ",M[1]
    fileInput.close()
    end=time.time()
    print "Time for completion",end-start

def readAsCsv():
    X=list()
    start=time.time()
    spamReader = csv.reader(open('csvSample.csv', 'rb'))
    for row in spamReader:
        X.append(row)
    print "List length: ",len(X)
    #print "Sample element: ",X[1]
    end=time.time()
    print "Time for completion",end-start

And my results: 我的结果是:

=======================
Populating list from Mmap
List length:  1181220
Time for completion 0.592000007629

=======================
Populating list from Fileopen
List length:  1181220
Time for completion 0.833999872208

=======================
Populating list by csv library
List length:  1181220
Time for completion 5.06700015068

So it seems that the csv library most people use is really alot slower than the others. 所以似乎大多数人使用的csv库实际上比其他人慢。 Maybe later it proves to be faster when I start extracting data from the csv file but I cannot be sure for that yet. 也许以后当我开始从csv文件中提取数据时它被证明更快,但我还不能确定。 Any suggestions and tips before I start implementing? 在开始实施之前有什么建议和提示吗? Thanks alot! 非常感谢!

As pointed out several other times, the first two methods do no actual string parsing, they just read a line at a time without extracting fields. 正如其他几次指出的那样,前两种方法没有实际的字符串解析,它们只是一次读取一行而不提取字段。 I imagine the majority of the speed difference seen in CSV is due to that. 我想在CSV中看到的大部分速度差异都归因于此。

The CSV module is invaluable if you include any textual data that may include more of the 'standard' CSV syntax than just commas, especially if you're reading from an Excel format. 如果您包含的任何文本数据可能包含更多“标准”CSV语法而不仅仅是逗号,则CSV模块非常有用,尤其是在您阅读Excel格式时。

If you've just got lines like "1,2,3,4" you're probably fine with a simple split, but if you have lines like "1,2,'Hello, my name\\'s fred'" you're going to go crazy trying to parse that without errors. 如果你刚刚得到像“1,2,3,4”这样的线条,你可能会很简单,但如果你有像"1,2,'Hello, my name\\'s fred'"你”我会疯狂地试图解析那些没有错误。

CSV will also transparently handle things like newlines in the middle of a quoted string. CSV还可以透明地处理引用字符串中间的换行符。 A simple for..in without CSV is going to have trouble with that. 一个简单的for..in没有CSV会遇到麻烦。

The CSV module has always worked fine for me reading unicode strings if I use it like so: 如果我像这样使用它,CSV模块一直很好地读取unicode字符串:

f = csv.reader(codecs.open(filename, 'rU'))

It is plenty of robust for importing multi-thousand line files with unicode, quoted strings, newlines in the middle of quoted strings, lines with fields missing at the end, etc. all with reasonable read times. 它非常强大,可以导入数千行文件,其中包含unicode,带引号的字符串,引用字符串中间的换行符,末尾缺少字段的行等,所有这些都具有合理的读取时间。

I'd try using it first and only looking for optimizations on top of it if you really need the extra speed. 我首先尝试使用它,如果你真的需要额外的速度,只能在它上面寻找优化。

How much do you care about sanitization? 你关心消毒多少钱?

The csv module is really good at understanding different csv file dialects and ensuring that escaping is happing properly, but it's definitely overkill and can often be way more trouble than it's worth (especially if you have unicode!) csv模块非常擅长理解不同的csv文件方言,并确保转义正常,但它肯定是过度杀戮,而且往往比它的价值更麻烦(特别是如果你有unicode!)

A really naive implementation that properly escapes \\, would be: 一个真正天真的实现,正确逃脱\\,将是:

import re

def read_csv_naive():
    with open(<file_str>, 'r') as file_obj:
      return [re.split('[^\\],', x) for x in file_obj.splitlines()]

If your data is simple this will work great. 如果您的数据很简单,这将非常有用。 If you have data that might need more escaping, the csv module is probably your most stable bet. 如果你有可能需要更多转义的数据, csv模块可能是你最稳定的赌注。

To read large csv file we have to create child process to read the chunks of file. 要读取大型csv文件,我们必须创建子进程来读取文件块。 Open the file to get the file resource object. 打开文件以获取文件资源对象。 Create a child process, with resource as argument. 使用resource作为参数创建子进程。 Read the set of lines as chunk. 将这组行读作块。 Repeat the above 3 steps until you reach the end of file. 重复上述3个步骤,直到到达文件末尾。

from multiprocessing import Process

def child_process(name):
    # Do the Read and Process stuff here.if __name__ == '__main__':
    # Get file object resource.
    .....
    p = Process(target=child_process, args=(resource,))
    p.start()
    p.join()

For code go to this link. 有关代码,请转到此链接。 This will helps you. 这会对你有所帮助。 http://besttechlab.wordpress.com/2013/12/14/read-csv-file-in-python/ http://besttechlab.wordpress.com/2013/12/14/read-csv-file-in-python/

Your first 2 methods are NOT parsing each line into fields. 您的前两种方法不是将每一行解析为字段。 The csv way is parsing out rows (NOT the same as lines!) of fields. csv方式是解析字段的行(与行!不同!)。

Do your really need to build a list in memory of all the lines? 你真的需要在所有线路的内存中建立一个列表吗?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中解析文件的最有效方法 - Most efficient way to parse a file in Python Biopython(或通常只是 Python):使用 gi 标识符从大型 .fasta 文件解析物种名称的最有效方法 - Biopython (or just Python in general): Most Efficient Way to Parse Species Name From A large .fasta file using gi identifier 使用 Python 进行预处理后,将 large.txt 文件(大小 &gt;30GB).txt 转换为.csv 的最有效方法 - Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python 使用 Python 读取位于 S3 (AWS) 上的大型 CSV 文件(10 M+ 条记录)的最有效方法是什么? - What is the most efficient way to read a large CSV file ( 10 M+ records) located on S3 (AWS) with Python? Python - 根据标准生成大型集合组合的最有效方法? - Python - most efficient way to generate combinations of large sets subject to criteria? 在Python中修改大型文本文件的最后一行的最有效方法 - Most efficient way to modify the last line of a large text file in Python 在python中存储大对称稀疏矩阵的最有效方法 - The most efficient way to store large symmetric sparse matrices in python 在 Python 中压缩大表的最有效方法是什么 - What is the most efficient way of condensing a large table in Python 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 用Python解析此XML网站地图的最有效方法是什么? - Whats the most efficient way to parse this XML sitemap with Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM