用 Python 编写大型 CSV 的最快方法

Question

I want to write some random sample data in a csv file until it is 1GB big.我想在 csv 文件中写入一些随机样本数据，直到它达到 1GB 大。 Following code is working:以下代码正在工作：

import numpy as np
import uuid
import csv
import os
outfile = 'data.csv'
outsize = 1024 # MB
with open(outfile, 'ab') as csvfile:
    wtr = csv.writer(csvfile)
    while (os.path.getsize(outfile)//1024**2) < outsize:
        wtr.writerow(['%s,%.6f,%.6f,%i' % (uuid.uuid4(), np.random.random()*50, np.random.random()*50, np.random.randint(1000))])

How to get it faster?如何更快地得到它？

Answer 1

The problem appears to be mainly IO-bound.问题似乎主要是 IO 绑定的。 You can improve the I/O a bit by writing to the file in larger chunks instead of writing one line at a time:您可以通过以更大的块写入文件而不是一次写入一行来稍微改进 I/O：

import numpy as np
import uuid
import os
outfile = 'data-alt.csv'
outsize = 10 # MB
chunksize = 1000
with open(outfile, 'ab') as csvfile:
    while (os.path.getsize(outfile)//1024**2) < outsize:
        data = [[uuid.uuid4() for i in range(chunksize)],
                np.random.random(chunksize)*50,
                np.random.random(chunksize)*50,
                np.random.randint(1000, size=(chunksize,))]
        csvfile.writelines(['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)])

You can experiment with the chunksize (the number of rows written per chunk) to see what works best on your machine.您可以尝试使用 chunksize（每个块写入的行数）来查看什么在您的机器上效果最好。

Here is a benchmark, comparing the above code to your original code, with outsize set to 10 MB:这是一个基准，将上述代码与您的原始代码进行比较，并将outsize设置为 10 MB：

% time original.py

real    0m5.379s
user    0m4.839s
sys 0m0.538s

% time write_in_chunks.py

real    0m4.205s
user    0m3.850s
sys 0m0.351s

So this is is about 25% faster than the original code.所以这比原始代码快了大约 25%。

PS.附注。 I tried replacing the calls to os.path.getsize with an estimation of the number of total lines needed.我尝试用对所需总行数的估计替换对os.path.getsize的调用。 Unfortunately, it did not improve the speed.不幸的是，它并没有提高速度。 Since the number of bytes needed to represent the final int varies, the estimation also is inexact -- that is, it does not perfectly replicate the behavior of your original code.由于表示最终 int 所需的字节数各不相同，因此估计也不准确——也就是说，它不能完美地复制原始代码的行为。 So I left the os.path.getsize in place.所以我把os.path.getsize留在了原处。

Answer 2

Removing all unnecessary stuff, and therefore it should be faster and easier to understand:删除所有不必要的东西，因此它应该更快更容易理解：

import random
import uuid
outfile = 'data.csv'
outsize = 1024 * 1024 * 1024 # 1GB
with open(outfile, 'ab') as csvfile:
    size = 0
    while size < outsize:
        txt = '%s,%.6f,%.6f,%i\n' % (uuid.uuid4(), random.random()*50, random.random()*50, random.randrange(1000))
        size += len(txt)
        csvfile.write(txt)

Answer 3

This is an update building on unutbu's answer above:这是基于上述 unutbu 答案的更新：

A large % of the time was spent in generating random numbers and checking the file size.大部分时间都花在生成随机数和检查文件大小上。

If you generate the rows ahead of time you can assess the raw disc io performance:如果您提前生成行，您可以评估原始磁盘 io 性能：

import time
from pathlib import Path
import numpy as np
import uuid
outfile = Path('data-alt.csv')
chunksize = 1_800_000

data = [
    [uuid.uuid4() for i in range(chunksize)],
    np.random.random(chunksize) * 50,
    np.random.random(chunksize) * 50,
    np.random.randint(1000, size=(chunksize,))
]
rows = ['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)]

t0 = time.time()
with open(outfile, 'a') as csvfile:
    csvfile.writelines(rows)
tdelta = time.time() - t0
print(tdelta)

On my standard 860 evo ssd (not nvme), I get 1.43 sec for 1_800_000 rows so that's 1,258,741 rows/sec (not too shabby imo)在我的标准 860 evo ssd（不是 nvme）上，1_800_000 行我得到 1.43 秒，所以这是 1,258,741 行/秒（不太破旧的 imo）

用 Python 编写大型 CSV 的最快方法

问题描述

3 个解决方案

解决方案1
10 2015-01-01 15:43:10

解决方案2
6 已采纳 2015-01-01 14:45:22

解决方案3
0 2021-01-24 14:38:57

用 Python 编写大型 CSV 的最快方法

问题描述

3 个解决方案

解决方案1 10 2015-01-01 15:43:10

解决方案2 6 已采纳 2015-01-01 14:45:22

解决方案3 0 2021-01-24 14:38:57

解决方案1
10 2015-01-01 15:43:10

解决方案2
6 已采纳 2015-01-01 14:45:22

解决方案3
0 2021-01-24 14:38:57