简体   繁体   English

将大型 DataFrame 输出到 CSV 文件的最快方法是什么?

[英]What is the fastest way to output large DataFrame into a CSV file?

For python / pandas I find that df.to_csv(fname) works at a speed of ~1 mln rows per min.对于 python/pandas,我发现 df.to_csv(fname) 以每分钟约 100 万行的速度工作。 I can sometimes improve performance by a factor of 7 like this:我有时可以像这样将性能提高 7 倍:

def df2csv(df,fname,myformats=[],sep=','):
  """
    # function is faster than to_csv
    # 7 times faster for numbers if formats are specified, 
    # 2 times faster for strings.
    # Note - be careful. It doesn't add quotes and doesn't check
    # for quotes or separators inside elements
    # We've seen output time going down from 45 min to 6 min 
    # on a simple numeric 4-col dataframe with 45 million rows.
  """
  if len(df.columns) <= 0:
    return
  Nd = len(df.columns)
  Nd_1 = Nd - 1
  formats = myformats[:] # take a copy to modify it
  Nf = len(formats)
  # make sure we have formats for all columns
  if Nf < Nd:
    for ii in range(Nf,Nd):
      coltype = df[df.columns[ii]].dtype
      ff = '%s'
      if coltype == np.int64:
        ff = '%d'
      elif coltype == np.float64:
        ff = '%f'
      formats.append(ff)
  fh=open(fname,'w')
  fh.write(','.join(df.columns) + '\n')
  for row in df.itertuples(index=False):
    ss = ''
    for ii in xrange(Nd):
      ss += formats[ii] % row[ii]
      if ii < Nd_1:
        ss += sep
    fh.write(ss+'\n')
  fh.close()

aa=DataFrame({'A':range(1000000)})
aa['B'] = aa.A + 1.0
aa['C'] = aa.A + 2.0
aa['D'] = aa.A + 3.0

timeit -r1 -n1 aa.to_csv('junk1')    # 52.9 sec
timeit -r1 -n1 df2csv(aa,'junk3',myformats=['%d','%.1f','%.1f','%.1f']) #  7.5 sec

Note: the increase in performance depends on dtypes.注意:性能的提高取决于 dtypes。 But it is always true (at least in my tests) that to_csv() performs much slower than non-optimized python.但是,to_csv() 的执行速度比未优化的 Python 慢得多,这始终是正确的(至少在我的测试中)。

If I have a 45 million rows csv file, then:如果我有一个 4500 万行的 csv 文件,那么:

aa = read_csv(infile)  #  1.5 min
aa.to_csv(outfile)     # 45 min
df2csv(aa,...)         # ~6 min

Questions:问题:

What are the ways to make the output even faster?
What's wrong with to_csv() ? Why is it soooo slow ?

Note: my tests were done using pandas 0.9.1 on a local drive on a Linux server.注意:我的测试是在 Linux 服务器的本地驱动器上使用 pandas 0.9.1 完成的。

Lev.列夫。 Pandas has rewritten to_csv to make a big improvement in native speed. Pandas 重写了to_csv以大幅提高原生速度。 The process is now i/o bound, accounts for many subtle dtype issues, and quote cases.该过程现在是 i/o 绑定的,说明了许多微妙的 dtype 问题和引用案例。 Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release.这是我们与 0.10.1(即将发布的 0.11)版本相比的性能结果。 These are in ms , lower ratio is better.这些以ms ,比率越低越好。

Results:
                                            t_head  t_baseline      ratio
name                                                                     
frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849
write_csv_standard  (10k rows)             38.1940    234.2570     0.1630
frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286
frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951

So Throughput for a single dtype (eg floats), not too wide is about 20M rows / min, here is your example from above.因此,单个 dtype(例如浮点数)的吞吐量,不太宽,大约为 20M 行/分钟,这是上面的示例。

In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')}) 
In [13]: df['B'] = df['A'] + 1.0   
In [14]: df['C'] = df['A'] + 2.0
In [15]: df['D'] = df['A'] + 2.0
In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')
1 loops, best of 1: 119 s per loop

In 2019 for cases like this, it may be better to just use numpy.在 2019 年,对于这样的情况,最好只使用 numpy。 Look at the timings:看时间:

aa.to_csv('pandas_to_csv', index=False)
# 6.47 s

df2csv(aa,'code_from_question', myformats=['%d','%.1f','%.1f','%.1f'])
# 4.59 s

from numpy import savetxt

savetxt(
    'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',
    header=','.join(aa.columns), comments=''
)
# 3.5 s

So you can cut the time by a factor of two using numpy.因此,您可以使用 numpy 将时间减少两倍。 This, of course, comes at a cost of reduced flexibility (when compared to aa.to_csv ).当然,这是以降低灵活性为代价的(与aa.to_csv相比)。

Benchmarked with Python 3.7, pandas 0.23.4, numpy 1.15.2 ( xrange was replaced by range to make the posted function from the question work in Python 3).以 Python 3.7、pandas 0.23.4、numpy 1.15.2 为基准( xrangerange替换,以使问题中发布的函数在 Python 3 中工作)。

PS.附注。 If you need to include the index, savetxt will work fine - just pass df.reset_index().values and adjust the formatting string accordingly.如果您需要包含索引, savetxt可以正常工作 - 只需传递df.reset_index().values并相应地调整格式字符串。

use chunksize.使用块大小。 I have found that makes a hell lot of difference.我发现这有很大的不同。 If you have memory in hand use good chunksize (no of rows) to get into memory and then write once.如果您手头有内存,请使用良好的块大小(行数)进入内存,然后写入一次。

Your df_to_csv function is very nice, except it does a lot of assumptions and doesn't work for the general case.您的df_to_csv函数非常好,除了它做了很多假设并且不适用于一般情况。

If it works for you, that's good, but be aware that it is not a general solution.如果它对您有用,那很好,但请注意,这不是通用的解决方案。 CSV can contain commas, so what happens if there is this tuple to be written? CSV 可以包含逗号,那么如果要写入这个元组会发生什么? ('a,b','c')

The python csv module would quote that value so that no confusion arises, and would escape quotes if quotes are present in any of the values. python csv模块会引用该值,以便不会出现混淆,并且如果任何值中存在引号,则会转义引号。 Of course generating something that works in all cases is much slower.当然,生成在所有情况下都有效的东西要慢得多。 But I suppose you only have a bunch of numbers.但我想你只有一堆数字。

You could try this and see if it is faster:你可以试试这个,看看它是否更快:

#data is a tuple containing tuples

for row in data:
    for col in xrange(len(row)):
        f.write('%d' % row[col])
        if col < len(row)-1:
            f.write(',')
    f.write('\n')

I don't know if that would be faster.不知道这样会不会更快。 If not it's because too many system calls are done, so you might use StringIO instead of direct output and then dump it to a real file every once in a while.如果不是,那是因为完成了太多的系统调用,因此您可能会使用StringIO而不是直接输出,然后每隔一段时间将其转储到真实文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM