寻找一种更有效的方法来重组Python中的大量CSV

Question

I've been working on a problem where I have data from a large output .txt file, and now have to parse and reorganize certain values in the the form of a .csv. 我一直在研究一个问题，我从大输出.txt文件中获取数据，现在必须以.csv的形式解析和重组某些值。

I've already written a script that input all the data into a .csv in columns based on what kind of data it is (Flight ID, Latitude, Longitude, etc), but it's not in the correct order. 我已经编写了一个脚本，根据它的数据类型（航班ID，纬度，经度等）将所有数据输入到列中的.csv中，但它的顺序不正确。 All values are meant to be grouped based on the same Flight ID, in order from earliest time stamp to the latest. 所有值都应根据相同的航班ID进行分组，从最早的时间戳到最新的时间戳。 Fortunately, my .csv has all values in the correct time order, but not grouped together appropriately according to Flight ID's. 幸运的是，我的.csv具有正确时间顺序的所有值，但未根据航班ID进行适当组合。

To clear my description up, it looks like this right now, 要清除我的描述，它现在看起来像这样，

("Time x" is just to illustrate): （“时间x”只是为了说明）：

20110117559515, , , , , , , , ,2446,6720,370,42  (Time 0)                               
20110117559572, , , , , , , , ,2390,6274,410,54  (Time 0)                               
20110117559574, , , , , , , , ,2391,6284,390,54  (Time 0)                               
20110117559587, , , , , , , , ,2385,6273,390,54  (Time 0)                               
20110117559588, , , , , , , , ,2816,6847,250,32  (Time 0) 
...

and it's supposed to be ordered like this: 它应该像这样订购：

20110117559515, , , , , , , , ,2446,6720,370,42  (Time 0)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 1)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 2)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 3)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time N)
20110117559572, , , , , , , , ,2390,6274,410,54  (Time 0)
20110117559572, , , , , , , , ,23xx,62xx,4xx,54  (Time 1)
... and so on

There are 1.3 million some rows in the .csv I output to make things easier. .csv I输出中有130万行，以简化操作。 I'm 99% confident the logic in the next script I wrote to fix the ordering is correct, but my fear is that it's extremely inefficient. 我99％有信心我写的下一个脚本中的逻辑来修复排序是正确的，但我担心这是非常低效的。 I ended up adding a progress bar just to see if it's making any progress, and unfortunately this is what I see: 我最后添加了一个进度条，看看它是否取得了进展，不幸的是这就是我所看到的：

在此输入图像描述

Here's my code handling the crunching (skip down to problem area if you like): 这是处理运算的代码（如果你愿意，可以跳到问题区域）：

## a class I wrote to handle the huge .csv's ##
from BIGASSCSVParser import BIGASSCSVParser               
import collections                                                              


x = open('newtrajectory.csv')  #file to be reordered                                                  
linetlist = []                                                                  
tidict = {}               

'' To save braincells I stored all the required values
   of each line into a dictionary of tuples.
   Index: Tuple ''

for line in x:                                                                  
    y = line.replace(',',' ')                                                   
    y = y.split()                                                               
    tup = (y[0],y[1],y[2],y[3],y[4])                                            
    linetlist.append(tup)                                                       
for k,v in enumerate(linetlist):                                                
    tidict[k] = v                                                               
x.close()                                                                       


trj = BIGASSCSVParser('newtrajectory.csv')                                      
uniquelFIDs = []                                                                
z = trj.column(0)   # List of out of order Flight ID's                                                     
for i in z:         # like in the example above                                                           
    if i in uniquelFIDs:                                                        
        continue                                                                
    else:                                                                       
        uniquelFIDs.append(i)  # Create list of unique FID's to refer to later                                               

queue = []                                                                              
p = collections.OrderedDict()                                                   
for k,v in enumerate(trj.column(0)):                                            
    p[k] = v

All good so far, but it's in this next segment my computer either chokes, or my code just sucks: 到目前为止一切都很好，但是在下一个部分，我的计算机要么窒息，要么我的代码很糟糕：

for k in uniquelFIDs:                                                           
    list = [i for i, x in p.items() if x == k]                                  
    queue.extend(list)

The idea was that for every unique value, in order, iterate over the 1.3 million values and return, in order, each occurrence's index, and append those values to a list. 我们的想法是，对于每个唯一值，按顺序迭代130万个值并按顺序返回每个匹配项的索引，并将这些值附加到列表中。 After that I was just going to read off that large list of indexes and write the contents of that row's data into another .csv file. 之后，我只是要读取大量索引列表，并将该行数据的内容写入另一个.csv文件。 Ta da! 塔达！ Probably hugely inefficient. 可能非常低效。

What's wrong here? 这有什么不对？ Is there a more efficient way to do this problem? 有没有更有效的方法来解决这个问题？ Is my code flawed, or am I just being cruel to my laptop? 我的代码有缺陷，还是我只是对我的笔记本电脑残忍？

Update: 更新：

I've found that with the amount of data I'm crunching, it'll take 9-10 hours. 我发现，随着我正在处理的数据量，它需要9-10个小时。 I had half of it correctly spat out in 4.5. 我有一半正确吐出4.5。 An overnight crunch I can get away with for now, but will probably look to use a database or another language next time. 我可以暂时解决一夜之间的危机，但下次可能会使用数据库或其他语言。 I would have if I knew what I was getting into ahead of time, lol. 如果我知道我提前得到了什么，我会的，哈哈。

After adjusting sleep settings for my SSD, it only took 3 hours to crunch. 调整我的SSD的睡眠设置后，它只需要3个小时来处理。

Answer 1

You can try the UNIX sort utility: 您可以尝试UNIX sort实用程序：

sort -n -s -t, -k1,1 infile.csv > outfile.csv

-t sets the delimiter and -k sets the sort key. -t设置分隔符， -k设置排序键。 -s stabilizes the sort, and -n uses numeric comparison. -s稳定排序， -n使用数字比较。

Answer 2

If the CSV file would fit into your RAM (eg less than 2GB), then you can just read the whole thing and do a sort on it: 如果CSV文件适合你的RAM（例如小于2GB），那么你可以阅读整个内容并对其进行sort ：

data = list(csv.reader(fn))
data.sort(key=lambda line:line[0])
csv.writer(outfn).writerows(data)

That shouldn't take nearly as long if you don't thrash. 如果你不捶打，这不应该花费相当长的时间。 Note that .sort is a stable sort , so it will preserve the time order of your file when the keys are equal. 请注意， .sort是一种稳定的排序 ，因此当密钥相等时，它将保留文件的时间顺序。

If it won't fit into RAM, you will probably want to do something a bit clever. 如果它不适合RAM，你可能想要做一些有点聪明的事情。 For example, you can store the file offsets of each line, along with the necessary information from the line (timestamp and flight ID), then sort on those, and write the output file using the line offset information. 例如，您可以存储每行的文件偏移量，以及行中的必要信息（时间戳和航班ID），然后对这些信息进行排序，并使用行偏移信息写入输出文件。

寻找一种更有效的方法来重组Python中的大量CSV

问题描述

2 个解决方案

解决方案1
3 2013-03-01 02:19:59

解决方案2
2 已采纳 2013-03-01 01:40:47

寻找一种更有效的方法来重组Python中的大量CSV

问题描述

2 个解决方案

解决方案1 3 2013-03-01 02:19:59

解决方案2 2 已采纳 2013-03-01 01:40:47

解决方案1
3 2013-03-01 02:19:59

解决方案2
2 已采纳 2013-03-01 01:40:47