简体   繁体   English

使用 Python 读取位于 S3 (AWS) 上的大型 CSV 文件(10 M+ 条记录)的最有效方法是什么?

[英]What is the most efficient way to read a large CSV file ( 10 M+ records) located on S3 (AWS) with Python?

I've been trying to find the fastest way to read a large csv file ( 10+ million records) from S3 and do a couple of simple operations with one of the columns ( total number of rows and mean).我一直在尝试找到从 S3 读取大型 csv 文件(10+ 百万条记录)的最快方法,并对其中一列(总行数和平均值)进行一些简单的操作。 I have ran a couple of tests, and the fastest so far was creating a dask dataframe, but I am wondering if there is any other alternative out there that may make things even faster.我已经进行了几次测试,到目前为止最快的是创建一个 dask dataframe,但我想知道是否还有其他替代方案可以让事情变得更快。

Any suggestions?有什么建议么? Thanks!谢谢!

Test 1. Pandas read csv: 92.36531567573547 seconds测试 1. Pandas 读取 csv: 92.36531567573547 秒

start_time = time.time()
s3 = boto3.client('s3')
path =my_csvS3
use_column=['tip_amount']
df= pd.read_csv(path,usecols=use_column)
print(df.count)
print (df["tip_amount"].mean())
print("%s seconds" % ((time.time())-(start_time)))

Test 2 Pandas read csv in chunks: 78.15214204788208 seconds测试 2 Pandas 在块中读取 csv:78.15214204788208 秒

import time
start_time = time.time()
tp = pd.read_csv(path, usecols=use_column, iterator=True, chunksize=5000000)  # gives TextFileReader
df = pd.concat(tp, ignore_index=True)
print(df.count)
print (df["tip_amount"].mean())
print("%s seconds" % ((time.time())-(start_time)))

Test 3 dask dataframe: 54.183971881866455 seconds测试 3 dask dataframe:54.183971881866455 秒


import dask.dataframe as dd
import time
start_time = time.time()
s3 = boto3.client('s3')
df = dd.read_csv(path)
df = df['tip_amount'] 
cols=['tip_amount']
dfp = df.compute()
print(len(dfp))
print (dfp.mean())
print("%s seconds" % ((time.time())-(start_time)))

This line这条线

dfp = df.compute()

is an antipattern for dask.是 dask 的反模式。 You split up the load, btu then you form a single large dataframe in memory by concatenation.您拆分负载,然后通过串联在 memory 中形成单个大型 dataframe。 You would do better to compute what you want on the original chunks (note that len is special in python, so this is less tidy than it would be你最好在原始块上计算你想要的东西(注意len在 python 中是特殊的,所以这没有那么整洁

dask.compute(df.shape[0], df.mean())

Also, you might find better performance with the distributed scheduler, even on a single machine, for some workloads.此外,对于某些工作负载,即使在单台机器上,您也可能会发现分布式调度程序的性能更好。 In this case, I believe most of the work is GIL-free, which is the critical consideration.在这种情况下,我相信大部分工作都是无 GIL 的,这是关键的考虑因素。 Still, worth measuring the difference.不过,值得衡量差异。

Additional: if you really only want one column, specify this in your read_csv - this goes for all backends!附加:如果您真的只想要一列,请在您的read_csv中指定 - 这适用于所有后端!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 在python中解析大型.csv的最有效方法? - Most efficient way to parse a large .csv in python? 无法从s3存储桶读取大型csv文件到python - unable to read large csv file from s3 bucket to python 在.csv 中读取和扩充(复制样本并更改某些值)大型数据集的最有效方法是什么 - What is the most efficient way to read and augment (copy samples and change some values) large dataset in .csv Python 中迭代大文件 (10GB+) 的最有效方法 - Most efficient way in Python to iterate over a large file (10GB+) 什么是最有效的Python睡眠方式? - What's the most efficient way to sleep in Python? 在python中检查同一个存储桶中是否存在多个s3键的最有效方法是什么? - What is the most efficient way in python of checking the existence of multiple s3 keys in the same bucket? 将数千条记录插入表中的最有效方法是什么(MySQL,Python,Django) - What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django) 熊猫:从S3访问文件的最有效方法 - Pandas: most efficient way to access file from s3 Python:如何以有效的方式读取 .csv 文件? - Python: how to read .csv file in an efficient way?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM