简体   繁体   English

将数据从 S3 加载到 dask 数据帧

[英]Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public.只有在将文件公开后将“anon”参数更改为 True 时,我才能加载数据。

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'anon':False})

This is not recommended for obvious reasons.出于明显的原因,不建议这样做。 How do I load the data from S3 securely?如何安全地从 S3 加载数据?

The backend which loads the data from s3 is s3fs, and it has a section on credentials here , which mostly points you to boto3's documentation.从 s3 加载数据的后端是 s3fs,它在此处有一个关于凭据的部分,主要指向 boto3 的文档。

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).简短的回答是,有多种提供 S3 凭证的方法,其中一些是自动的(位于正确位置的文件或环境变量 - 所有工作人员或集群元数据服务必须可以访问)。

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers或者,您可以直接在通话中提供您的密钥/秘密,但这当然意味着您信任您的执行平台和工作人员之间的沟通

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'key': mykey, 'secret': mysecret})

The set of parameters you can pass in storage_options when using s3fs can be found in the API docs .可以在API 文档中找到使用 s3fs 时可以在storage_options传递的参数集。

General reference http://docs.dask.org/en/latest/remote-data-services.html一般参考http://docs.dask.org/en/latest/remote-data-services.html

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:如果您在您的虚拟私有云 (VPC) 中,s3 可能已经获得认证,您可以在没有密钥的情况下读取文件:

import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')

If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):如果您没有凭据,则可以使用storage_options参数并传递密钥对(密钥和秘密):

import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)

Full documentation from dask can be found here可以在此处找到 dask 的完整文档

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports eg role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables.引擎盖下的 Dask 使用 boto3,因此您几乎可以以 boto3 支持的所有方式设置您的密钥,例如基于角色的导出 AWS_PROFILE=xxxx 或通过您的环境变量显式导出访问密钥和秘密。 I would advise against hard-coding your keys least you expose your code to the public by a mistake.我建议不要对您的密钥进行硬编码,至少您会因错误而将代码暴露给公众。

$ export AWS_PROFILE=your_aws_cli_profile_name

or或者

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

For s3 you can use wildcard match to fetch multiple chunked files对于 s3,您可以使用通配符匹配来获取多个分块文件

import dask.dataframe as dd

# Given N number of csv files located inside s3 read and compute total record len

s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'

df = dd.read_csv(s3_url)

print(df.head())

print(len(df))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM