将数据从 S3 加载到 dask 数据帧

Question

I can load the data only if I change the "anon" parameter to True after making the file public.只有在将文件公开后将“anon”参数更改为 True 时，我才能加载数据。

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'anon':False})

This is not recommended for obvious reasons.出于明显的原因，不建议这样做。 How do I load the data from S3 securely?如何安全地从 S3 加载数据？

Answer 1

The backend which loads the data from s3 is s3fs, and it has a section on credentials here , which mostly points you to boto3's documentation.从 s3 加载数据的后端是 s3fs，它在此处有一个关于凭据的部分，主要指向 boto3 的文档。

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).简短的回答是，有多种提供 S3 凭证的方法，其中一些是自动的（位于正确位置的文件或环境变量 - 所有工作人员或集群元数据服务必须可以访问）。

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers或者，您可以直接在通话中提供您的密钥/秘密，但这当然意味着您信任您的执行平台和工作人员之间的沟通

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'key': mykey, 'secret': mysecret})

The set of parameters you can pass in storage_options when using s3fs can be found in the API docs .可以在API 文档中找到使用 s3fs 时可以在storage_options传递的参数集。

General reference http://docs.dask.org/en/latest/remote-data-services.html一般参考http://docs.dask.org/en/latest/remote-data-services.html

Answer 2

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:如果您在您的虚拟私有云 (VPC) 中，s3 可能已经获得认证，您可以在没有密钥的情况下读取文件：

import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')

If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):如果您没有凭据，则可以使用storage_options参数并传递密钥对（密钥和秘密）：

import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)

Full documentation from dask can be found here可以在此处找到 dask 的完整文档

Answer 3

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports eg role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables.引擎盖下的 Dask 使用 boto3，因此您几乎可以以 boto3 支持的所有方式设置您的密钥，例如基于角色的导出 AWS_PROFILE=xxxx 或通过您的环境变量显式导出访问密钥和秘密。 I would advise against hard-coding your keys least you expose your code to the public by a mistake.我建议不要对您的密钥进行硬编码，至少您会因错误而将代码暴露给公众。

$ export AWS_PROFILE=your_aws_cli_profile_name

or或者

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

For s3 you can use wildcard match to fetch multiple chunked files对于 s3，您可以使用通配符匹配来获取多个分块文件

import dask.dataframe as dd

# Given N number of csv files located inside s3 read and compute total record len

s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'

df = dd.read_csv(s3_url)

print(df.head())

print(len(df))

将数据从 S3 加载到 dask 数据帧

问题描述

3 个解决方案

解决方案1
5 已采纳 2019-01-14 14:50:44

解决方案2
3 2019-01-15 14:48:02

解决方案3
0 2020-01-06 08:49:28

将数据从 S3 加载到 dask 数据帧

问题描述

3 个解决方案

解决方案1 5 已采纳 2019-01-14 14:50:44

解决方案2 3 2019-01-15 14:48:02

解决方案3 0 2020-01-06 08:49:28

解决方案1
5 已采纳 2019-01-14 14:50:44

解决方案2
3 2019-01-15 14:48:02

解决方案3
0 2020-01-06 08:49:28