简体   繁体   English

通过 s3_additional_kwargs 将 Pandas Dataframe 加载到 S3

[英]Load Pandas Dataframe to S3 passing s3_additional_kwargs

Please excuse my ignorance / lack of knowledge in this area!请原谅我在这方面的无知/缺乏知识!

I'm looking to upload a dataframe to S3, but I need to pass 'ACL':'bucket-owner-full-control'.我希望将 dataframe 上传到 S3,但我需要传递“ACL”:“bucket-owner-full-control”。

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem(anon=False, s3_additional_kwargs={'ACL': 'bucket-owner-full-control'})
df = pd.DataFrame()
df['test'] = [1,2,3]
df.head()

df.to_parquet('s3://path/to/file/df.parquet', compression='gzip')

I have managed to get around this by then loading this to a Pyarrow table and the loading like:我设法解决了这个问题,然后将其加载到 Pyarrow 表并加载如下:

import pyarrow.parquet as pq

table = pa.Table.from_pandas(df)

pq.write_to_dataset(table=table, 
                    root_path='s3://path/to/file/',
                    filesystem=fs) 

But this feels hacky and I feel there must be a way to pass the ACL in the first example.但这感觉很老套,我觉得必须有一种方法可以在第一个示例中传递 ACL。

你能行的 :

pd.to_parquet('name.parquet',storage_options={"key":xxxxx,"secret":gcp_secret_access_key,'xxxxx':{'ACL': 'bucket-owner-full-control'}})

With Pandas 1.2.0, there is storage_options as mentioned here.对于 Pandas 1.2.0,这里提到了storage_options

If you are stuck with Pandas < 1.2.0 (1.1.3 in my case), this trick did help:如果你坚持使用 Pandas < 1.2.0(在我的例子中是 1.1.3),这个技巧确实有帮助:

storage_options = dict(anon=False, s3_additional_kwargs=dict(ACL="bucket-owner-full-control"))

import s3fs
fs = s3fs.S3FileSystem(**storage_options)
df.to_parquet('s3://foo/bar.parquet', filesystem=fs)

As mentioned before, with Pandas 1.2.0 there is a storage_options argument to most writer functions ( to_csv , to_parquet , etc.).如前所述,对于 Pandas 1.2.0,大多数编写器函数( to_csvto_parquet等)都有一个storage_options参数。 To set the ACL when writing to S3 (in this case the file system backend that is used is s3fs ) you can use this example:要在写入 S3 时设置 ACL(在本例中使用的文件系统后端是s3fs ),您可以使用以下示例:

ACL = dict(storage_options=dict(s3_additional_kwargs=dict(ACL='bucket-owner-full-control')))

import pandas as pd
df = pd.DataFrame({"column": [1,2,3,4]})
df.to_parquet("s3://bucket/file.parquet", **ACL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM