在python中读取大数据文件的最快方法

Question

I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (eg each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) . 我有一些（大约60个）巨大（> 2个演出）的CSV文件，我想循环进行次级选择（例如，每个文件包含1个月各种金融产品的数据，我希望进行60个月的时间序列每个产品）。

Reading an entire file into memory (eg by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. 将整个文件读到内存中（例如，通过将文件加载到excel或matlab中）是不可行的，因此我对stackoverflow的最初搜索使我尝试使用python。 My strategy was to loop through each line iteratively and write it away in some folder. 我的策略是迭代遍历每行并将其写到某个文件夹中。 This strategy works fine, but it is extremely slow. 此策略效果很好，但速度非常慢。

From my understanding there is a trade-off between memory usage and computation speed. 根据我的理解，在内存使用和计算速度之间需要权衡。 Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours). 将整个文件加载到内存中是频谱的一端（计算机崩溃），显然每次都将一行加载到内存中是另一端（计算时间约为5小时）。

So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. 所以我的主要问题是： *是否有一种方法可以将多行加载到内存中，从而可以更快地完成此过程（100次？）。 While not losing functionality? 虽然不失去功能？ * And if so, how would I implement this? *如果是这样，我将如何实施？ Or am I going about this all wrong? 还是我要解决所有这些错误？ Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). 请注意，下面只是我要尝试做的简化代码（我可能想在时间以外的其他维度进行子选择）。 Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month). 假定原始数据文件没有有意义的顺序（除了每月将它们分成60个文件）。

The method in particular I am trying is: 我正在尝试的方法特别是：

#Creates a time series per bond
import csv
import linecache


#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year


filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];


for j in range(0,5):
     for i in range(0,12):    
         filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')




for x in range (0,60):
line = linecache.getline('allBonds.txt', x)  
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:

     for line in text_file:

          temp=line.split(';');

          if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
               output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
               datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
               datawriter.writerow(temp)
               output_file.close()

Thanks in advance. 提前致谢。

Ps Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed. Ps只是为了确保：代码目前可以正常工作（尽管当然可以提出任何建议），但是问题在于速度。

Answer 1

I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . 我会测试pandas.read_csv中提到https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file 。 It supports reading the file in chunks ( iterator=True option) 它支持分块读取文件（ iterator=True选项）

I think this part of your code may cause serious performance problems if the condition is matched frequently. 我认为，如果条件经常匹配，则这部分代码可能会导致严重的性能问题。

if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
    output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
    datawriter = csv.writer(output_file,dialect='excel',delimiter='^', 
                            quoting=csv.QUOTE_MINIMAL)
    datawriter.writerow(temp)
    output_file.close()

It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop. 最好避免打开文件，创建cvs.writer（）对象，然后在循环内关闭文件。

在python中读取大数据文件的最快方法

问题描述

1 个解决方案

解决方案1
0 2016-06-18 11:07:00

在python中读取大数据文件的最快方法

问题描述

1 个解决方案

解决方案1 0 2016-06-18 11:07:00

解决方案1
0 2016-06-18 11:07:00