在 pandas.DataFrame.to_csv 命令中使用 fsspec

Question

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.我想在通过 smtp-ssh 连接的远程机器上从 pandas dataframe 写入 csv 文件。 Does anybody know how add "storage_options" parameter correctly?有人知道如何正确添加“storage_options”参数吗？

Pandas documentation says that I have to use some dict as parameter's value. Pandas 文档说我必须使用一些 dict 作为参数的值。 But I don't understand which exactly.但我不明白到底是哪个。

hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})

Every time I got ValueError: storage_options passed with file object or non-fsspec file path每次我得到ValueError: storage_options passed with file object or non-fsspec file path

What am I doing wrong?我究竟做错了什么？

Answer 1

If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this如果您没有云存储访问权限，则可以通过指定这样的匿名连接来访问公共数据

pd.read_csv('name',<other fields>, storage_options={"anon": True})

Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)否则应该以dict格式传递storage_options ，您将通过您的云VM主机（包括Amazon S3，Google Cloud，Azure等）获取name和key

pd.read_csv('name',<other fields>, \
           storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})

Answer 2

You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem .您将通过直接试验实现后端SFTPFileSystem来找到要使用的值集。 Whatever kwargs you use these are the same ones that would go into stoage_options .无论您使用什么 kwarg，这些都与将 go 放入stoage_options 。 Short story: paramiko is not the same as command-line SSH, so some trialing will be required.小故事：paramiko 和命令行 SSH 不一样，所以需要一些试用。

If you have things working via the file system class, you can use the alternative route如果你有东西通过文件系统 class 工作，你可以使用替代路线

fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
    pd.read_csv(f, other_kwargs)

Answer 3

Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). Pandas 支持fsspec ，让您可以轻松地使用远程文件系统，并通过s3fs为 Amazon S3 和gcfs抽象为 Google Cloud Storage（以及其他后端，如 (S)FTP、SSH 或 HDFS）。 In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.特别是 s3fs 对于在 S3 中执行简单的文件操作非常方便，因为 boto 通常使用起来非常复杂。

The argument storage_options will allow you to expose s3fs arguments to pandas.参数storage_options将允许您将s3fs arguments公开给 pandas。

You can specify an AWS Profile manually using the storage_options which takes a dict.您可以使用带有字典的storage_options手动指定 AWS 配置文件。 An example bellow:下面的一个例子：

import boto3

AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")

df.to_csv(
    f"s3://{AWS_S3_BUCKET}/{key}",
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
    },
)

在 pandas.DataFrame.to_csv 命令中使用 fsspec

问题描述

3 个解决方案

解决方案1
0 2021-05-20 09:08:14

解决方案2
0 2021-05-20 15:33:42

解决方案3
0 2022-08-18 13:41:47

在 pandas.DataFrame.to_csv 命令中使用 fsspec

问题描述

3 个解决方案

解决方案1 0 2021-05-20 09:08:14

解决方案2 0 2021-05-20 15:33:42

解决方案3 0 2022-08-18 13:41:47

解决方案1
0 2021-05-20 09:08:14

解决方案2
0 2021-05-20 15:33:42

解决方案3
0 2022-08-18 13:41:47