[英]What is the most efficient way to read a large CSV file ( 10 M+ records) located on S3 (AWS) with Python?
I've been trying to find the fastest way to read a large csv file ( 10+ million records) from S3 and do a couple of simple operations with one of the columns ( total number of rows and mean).我一直在尝试找到从 S3 读取大型 csv 文件(10+ 百万条记录)的最快方法,并对其中一列(总行数和平均值)进行一些简单的操作。 I have ran a couple of tests, and the fastest so far was creating a dask dataframe, but I am wondering if there is any other alternative out there that may make things even faster.我已经进行了几次测试,到目前为止最快的是创建一个 dask dataframe,但我想知道是否还有其他替代方案可以让事情变得更快。
Any suggestions?有什么建议么? Thanks!谢谢!
Test 1. Pandas read csv: 92.36531567573547 seconds测试 1. Pandas 读取 csv: 92.36531567573547 秒
start_time = time.time()
s3 = boto3.client('s3')
path =my_csvS3
use_column=['tip_amount']
df= pd.read_csv(path,usecols=use_column)
print(df.count)
print (df["tip_amount"].mean())
print("%s seconds" % ((time.time())-(start_time)))
Test 2 Pandas read csv in chunks: 78.15214204788208 seconds测试 2 Pandas 在块中读取 csv:78.15214204788208 秒
import time
start_time = time.time()
tp = pd.read_csv(path, usecols=use_column, iterator=True, chunksize=5000000) # gives TextFileReader
df = pd.concat(tp, ignore_index=True)
print(df.count)
print (df["tip_amount"].mean())
print("%s seconds" % ((time.time())-(start_time)))
Test 3 dask dataframe: 54.183971881866455 seconds测试 3 dask dataframe:54.183971881866455 秒
import dask.dataframe as dd
import time
start_time = time.time()
s3 = boto3.client('s3')
df = dd.read_csv(path)
df = df['tip_amount']
cols=['tip_amount']
dfp = df.compute()
print(len(dfp))
print (dfp.mean())
print("%s seconds" % ((time.time())-(start_time)))
This line这条线
dfp = df.compute()
is an antipattern for dask.是 dask 的反模式。 You split up the load, btu then you form a single large dataframe in memory by concatenation.您拆分负载,然后通过串联在 memory 中形成单个大型 dataframe。 You would do better to compute what you want on the original chunks (note that len
is special in python, so this is less tidy than it would be你最好在原始块上计算你想要的东西(注意len
在 python 中是特殊的,所以这没有那么整洁
dask.compute(df.shape[0], df.mean())
Also, you might find better performance with the distributed scheduler, even on a single machine, for some workloads.此外,对于某些工作负载,即使在单台机器上,您也可能会发现分布式调度程序的性能更好。 In this case, I believe most of the work is GIL-free, which is the critical consideration.在这种情况下,我相信大部分工作都是无 GIL 的,这是关键的考虑因素。 Still, worth measuring the difference.不过,值得衡量差异。
Additional: if you really only want one column, specify this in your read_csv
- this goes for all backends!附加:如果您真的只想要一列,请在您的read_csv
中指定 - 这适用于所有后端!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.