简体   繁体   English

Python CSV - 优化CSV读写

[英]Python CSV - optimizing CSV Read and Write

I am currently fiddling around with Python when my boss assigned me with a quite daunting task. 当我的老板给我一个非常艰巨的任务时,我正在摆弄Python。

He gave me a CSV file with around 14GB in size, and ask me if I can inflate that CSV to a delimited file with 4TB of size, by replicating itself several times. 他给了我一个CSV文件,14GB左右的规模,并问我是否可以认为膨胀 CSV同尺寸的4TB分隔文件,通过复制自身几次。

For example, take this CSV: 例如,请使用此CSV:

TIME_SK,ACCOUNT_NUMBER,ACCOUNT_TYPE_SK,ACCOUNT_STATUS_SK,CURRENCY_SK,GLACC_BUSINESS_NAME,PRODUCT_SK,PRODUCT_TERM_SK,NORMAL_BAL,SPECIAL_BAL,FINAL_MOV_YTD_BAL,NO_OF_DAYS_MTD,NO_OF_DAYS_YTD,BANK_FLAG,MEASURE_ID,SOURCE_SYSTEM_ID
20150131,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1
20150131,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1
20150131,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1
20150131,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1
20150131,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1
20150131,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1

He wants me to inflate the size by replicating the contents of the CSV, with altering the TIME_SK header, like below: 他希望我通过复制CSV的内容来扩大大小,改变TIME_SK标题,如下所示:

TIME_SK,ACCOUNT_NUMBER,ACCOUNT_TYPE_SK,ACCOUNT_STATUS_SK,CURRENCY_SK,GLACC_BUSINESS_NAME,PRODUCT_SK,PRODUCT_TERM_SK,NORMAL_BAL,SPECIAL_BAL,FINAL_MOV_YTD_BAL,NO_OF_DAYS_MTD,NO_OF_DAYS_YTD,BANK_FLAG,MEASURE_ID,SOURCE_SYSTEM_ID
20150131,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1
20150131,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1
20150131,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1
20150131,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1
20150131,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1
20150131,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1
20150201,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150201,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1
20150201,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150201,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150201,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1
20150201,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1
20150201,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1
20150201,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1
20150201,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1

and so on. 等等。

I was able to make the Python script to do the task, however when used on the real CSV file with tens of Gigabytes in size and hundred millions of row, the task was proved to be too long to complete (there was a time constraint at that time; however, he asked me to do it again now). 我能够让Python脚本完成任务,但是当在真正的CSV文件中使用时,其数量为数十GB,行数达数亿,这个任务被证明太长而无法完成(有时间限制在那个时候;但是,他让我现在再做一次)。

I am using the Python built in CSV Writer. 我正在使用内置CSV Writer的Python。 After a bit of research, I came up with two different approach: 经过一番研究,我提出了两种不同的方法:

1. The Old and Trusted Iterator 1.旧的和可信的迭代器

This is the first version of my script; 这是我脚本的第一个版本; it does the job all right, however it took too long for tackling the humongous CSV. 它可以完成任务,但是处理巨大的CSV需要很长时间。

. . . omitted . . .
with open('../csv/DAILY_DDMAST.csv', 'rb') as csvinput:
    with open('../result/DAILY_DDMAST_result1'+name_interval+'.csv', 'wb') as csvoutput:
        reader = csv.reader(csvinput)
        writer = csv.writer(csvoutput, lineterminator='\r\n')
# This part copies the original CSV to a new file
        for row in reader:
            writer.writerow(row)
        print("Done copying. Time elapsed: %s seconds, Total time: %s seconds" % 
              ((time.time() - start_time), (time.time() - start_time)))
        i = 0
        while i < 5:
# This part replicates the content of CSV, with modifying the TIME_SK value
            counter_time = time.time()
            for row in reader:
                newdate = datetime.datetime.strptime(row[0], "%Y%m%d") + datetime.timedelta(days=i)
                row[0] = newdate.strftime("%Y%m%d")
                writer.writerow(row)
            csvinput.seek(0)
            next(reader, None)
            print("Done processing for i = %d. Time elapsed: %s seconds, Total time: %s seconds" % 
              (i+1, (counter_time - start_time), (time.time() - start_time)))
            i += 1
. . . omitted . . . 

In my understanding, the script will iterate each row inside the CSV by for row in reader , and then write each row to the new file with writer.writerow(row) . 根据我的理解,脚本将迭代CSV中的每一行, for row in reader ,然后使用writer.writerow(row)每一行写入新文件。 I also found that by iterating the source file, it is a bit repetitive and time consuming, so I thought it could have been more efficient with other approach... 我还发现通过迭代源文件,它有点重复和耗时,所以我认为用其他方法可以更有效...

2. The Bucket 2.斗

This was intended as an "upgrade" to the first version of the script. 这是作为脚本第一版的“升级”。

. . . omitted . . .
with open('../csv/DAILY_DDMAST.csv', 'rb') as csvinput:
    with open('../result/DAILY_DDMAST_result2'+name_interval+'.csv', 'wb') as csvoutput:
        reader = csv.reader(csvinput)
        writer = csv.writer(csvoutput, lineterminator='\r\n')
        csv_buffer = list()
        for row in reader:
# Here, rather than directly writing the iterated row, I stored it in a list.
# If the list reached 1 mio rows, then it writes to the file and empty the "bucket"
            csv_buffer.append(row)
            if len(csv_buffer) > 1000000:
                writer.writerows(csv_buffer)
                del csv_buffer[:]
        writer.writerows(csv_buffer)
        print("Done copying. Time elapsed: %s seconds, Total time: %s seconds" % 
              ((time.time() - start_time), (time.time() - start_time)))
        i = 0
        while i < 5:
            counter_time = time.time()
            del csv_buffer[:]
            for row in reader:
                newdate = datetime.datetime.strptime(row[0], "%Y%m%d") + datetime.timedelta(days=i)
                row[0] = newdate.strftime("%Y%m%d")
# Same goes here
                csv_buffer.append(row)
                if len(csv_buffer) > 1000000:
                    writer.writerows(csv_buffer)
                    del csv_buffer[:]
            writer.writerows(csv_buffer)
            csvinput.seek(0)
            next(reader, None)
            print("Done processing for i = %d. Time elapsed: %s seconds, Total time: %s seconds" % 
                  (i+1, (counter_time - start_time), (time.time() - start_time)))            
            i += 1
. . . omitted . . . 

I thought, by storing it in memory then writing them altogether with writerows , I could've saved time. 我想,通过将它存储在内存中然后与写作者一起writerows ,我可以节省时间。 However, that was not the case. 但事实并非如此。 I found out that even if I store the rows to be written to the new CSV, writerows iterates the list then write them to the new file, thus it consumes nearly as long as the first script... 我发现即使我将要写入的行存储到新的CSV中, writerows 迭代列表然后将它们写入新文件,因此它的消耗时间几乎和第一个脚本一样长......

At this point, I don't know if I should come up with better algorithm or there is something that I could use - something like the writerows , only it does not iterate, but writes them all at once. 在这一点上,我不知道我是否应该提出更好的算法,或者我可以使用某些东西 - 比如writerows ,只有它不会迭代,而是一次性写入它们。

I don't know if such thing is possible or not, either 我不知道这样的事情是否可能

Anyway, I need help on this, and if anyone could shed some lights, I would be very thankful! 无论如何,我需要帮助,如果有人可以点灯,我会非常感激!

I don't have a 14GB file to try this with, so memory footprint is a concern. 我没有14GB的文件来试试这个,所以内存占用是一个问题。 Someone who knows regex better than myself might have some performance tweaking suggestions. 比我更了解正则表达式的人可能会有一些性能调整建议。

The main concept is don't iterate through each line when avoidable. 主要概念是在可避免时不迭代每一行。 Let re do it's magic on the whole body of text then write that body to the file. 让我们re做它的神奇文字的整个身体,然后写该机构的文件。

import re

newdate = "20150201,"
f = open('sample.csv', 'r')
g = open('result.csv', 'w')

body = f.read()
## keeps the original csv
g.write(body)  
# strip off the header -- we already have one.
header, mainbody = body.split('\n', 1)
# replace all the dates
newbody = re.sub(r"20150131,", newdate, mainbody)
#end of the body didn't have a newline. Adding one back in.
g.write('\n' + newbody)

f.close()
g.close()

Batch writing your rows isn't really going to be an improvement because your write IO's are still going to be the same size. 批量写入行并不是一个改进,因为您的写入IO仍然是相同的大小。 Batching up writes only gives you an improvement if you can increase your IO size, which reduces the number of system calls and allows the IO system to deal with fewer but larger writes. 如果可以增加IO大小,那么批量写入只会为您提供改进,从而减少系统调用的数量,并允许IO系统处理更少但更大的写入。

Honestly, I wouldn't complicate the code with batch writing for maintainability reasons, but I can certainly understand the desire to experiment with trying to improve the speed, if only for educational reasons. 老实说,出于可维护性原因,我不会因批量编写而使代码复杂化,但我当然可以理解尝试提高速度的尝试,如果仅出于教育原因。

What you want to do is batch up your writes -- batching up your csv rows doesn't really accomplish this. 您要做的是批量写入 - 批量编写csv行并不能实现这一点。

[Example using StringIO removed .. there's a better way.] [删除StringIO示例..有更好的方法。]

Python write() uses buffered I/O. Python write()使用缓冲的I / O. It just by default buffers at 4k (on Linux). 它默认情况下缓冲为4k(在Linux上)。 If you open the file with a buffering parameter you can make it bigger: 如果使用buffering参数打开文件,则可以将其设置为更大:

with open("/tmp/x", "w", 1024*1024) as fd:
    for i in range(0, 1000000):
        fd.write("line %d\n" %i)

Then your writes will be 1MB. 然后你的写入将是1MB。 strace output: strace输出:

write(3, "line 0\nline 1\nline 2\nline 3\nline"..., 1048576) = 1048576
write(3, "ine 96335\nline 96336\nline 96337\n"..., 1048576) = 1048576
write(3, "1\nline 184022\nline 184023\nline 1"..., 1048576) = 1048576
write(3, "ne 271403\nline 271404\nline 27140"..., 1048576) = 1048576
write(3, "58784\nline 358785\nline 358786\nli"..., 1048576) = 1048576
write(3, "5\nline 446166\nline 446167\nline 4"..., 1048576) = 1048576
write(3, "ne 533547\nline 533548\nline 53354"..., 1048576) = 1048576
[...]

Your simpler original code will work and you only need to change the blocksize for the open() calls (I would change it for both source and destination.) 您更简单的原始代码将起作用,您只需要更改open()调用的blocksize(我会为源目标更改它。)

My other suggestion is to abandon csv , but that potentially takes some risks. 我的另一个建议是放弃csv ,但这可能会带来一些风险。 If you have quoted strings with commas in them you have to create the right kind of parser. 如果您在其中引用了带逗号的字符串,则必须创建正确类型的解析器。

BUT -- since the field you want to modify is fairly regular and the first field, you may find it much simpler to just have a readline / write loop where you just replace the first field and ignore the rest. 但是 - 由于您要修改的字段是相当规则的并且是第一个字段,您可能会发现只需要一个readline / write循环就可以更简单地替换第一个字段并忽略其余字段。

#!/usr/bin/python
import datetime
import re

with open("/tmp/out", "w", 1024*1024) as fdout, open("/tmp/in", "r", 1024*1024) as fdin:
    for i in range(0, 6):
        fdin.seek(0)
        for line in fdin:
            if i == 0:
                fdout.write(line)
                continue
            match = re.search(r"^(\d{8}),", line)
            if match:
                date = datetime.datetime.strptime(match.group(1), "%Y%m%d")
                fdout.write(re.sub("^\d{8},", (date + datetime.timedelta(days=i)).strftime("%Y%m%d,"), line))
            else:
                if line.startswith("TIME_SK,"):
                    continue
                raise Exception("Could not find /^\d{8},/ in '%s'" % line)

If order doesn't matter, then don't reread the file over and over: 如果顺序无关紧要,则不要反复重读该文件:

#!/usr/bin/python
import datetime
import re

with open("/tmp/in", "r", 1024*1024) as fd, open("/tmp/out", "w", 1024*1024) as out:
    for line in fd:
        match = re.search("^(\d{8}),", line)
        if match:
            out.write(line)
            date = datetime.datetime.strptime(match.group(1), "%Y%m%d")
            for days in  range(1, 6):
                out.write(re.sub("^\d{8},", (date + datetime.timedelta(days=days)).strftime("%Y%m%d,"), line))
        else:
            if line.startswith("TIME_SK,"):
                out.write(line)
                continue
            raise Exception("Could not find /^\d{8},/ in %s" % line)

I went ahead and profiled one of these with python -mcProfile and was surprised how much time was spent in strptime . 我继续用python -mcProfile描述其中一个,并惊讶地花了多少时间花在strptime Also try caching your strptime() calls by using this memoized strptime() : 还可以尝试使用这个memoized strptime()缓存你的strptime()调用:

_STRPTIME = {}

def strptime(s):
    if s not in _STRPTIME:
        _STRPTIME[s] = datetime.datetime.strptime(s, "%Y%m%d")
    return _STRPTIME[s]

First of all, you're going to be limited by the write speed. 首先,你将受到写入速度的限制。 Typical write speed for a desktop machine is on the order of about 40 seconds per gigabyte. 台式机的典型写入速度大约为每千兆字节40秒。 You need to write 4,000 gigabytes, so it's going to take on the order of 160,000 seconds (44.5 hours) just to write the output. 你需要写4,000千兆字节,所以它只需要160,000秒(44.5小时)来编写输出。 The only way to reduce that time is to get a faster drive. 减少时间的唯一方法是获得更快的驱动器。

To make a 4 TB file by replicating a 14 GB file, you have to copy the original file 286 (actually 285.71) times. 要通过复制14 GB文件来制作4 TB文件,您必须复制原始文件286(实际上是285.71)次。 The simplest way to do things is: 最简单的做法是:

open output file
starting_date = date on first transaction
for pass = 1 to 286
    open original file
    while not end of file
        read transaction
        replace date
        write to output
        increment date
    end while
end for
close output file

But with a typical read speed of about 20 seconds per gigabyte, that's 80,000 seconds (22 hours and 15 minutes) just for reading. 但是,每千兆字节的典型读取速度约为20秒,那就是80,000秒(22小时15分钟),仅供阅读。

You can't do anything about the writing time, but you probably can reduce the reading time by a lot. 你无法对写作时间做任何事情,但你可能会大大减少阅读时间。

If you can buffer the whole 14 GB input file, then reading time becomes about five minutes. 如果你可以缓冲整个14 GB的输入文件,那么阅读时间大约是五分钟。

If you don't have the memory to hold the 14 GB, consider reading it into a compressed memory stream. 如果您没有内存来容纳14 GB,请考虑将其读入压缩内存流。 That CSV should compress quite well--to less than half of its current size. CSV应该压缩得很好 - 不到目前大小的一半。 Then, rather than opening the input file every time through the loop, you just re-initialize a stream reader from the compressed copy of the file you're holding in memory. 然后,不是每次通过循环打开输入文件,而只是从您在内存中保存的文件的压缩副本重新初始化流读取器。

In C#, I'd just use the MemoryStream and GZipStream classes. 在C#中,我只使用MemoryStreamGZipStream类。 A quick Google search indicates that similar capabilities exist in python, but since I'm not a python programmer I can't tell you exactly how to use them. 一个快速的谷歌搜索表明python中存在类似的功能,但由于我不是一个python程序员,我无法告诉你如何使用它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM