简体   繁体   English

将 Paramiko 连接 SFTPFile 作为输入传递给 dask.dataframe.read_parquet

[英]Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine.我试图通过类paramiko.sftp_file.SFTPFile而不是文件的URL pandas.read_parquet它工作得很好。 But when I tried the same with Dask, it threw an error.但是当我用 Dask 尝试同样的方法时,它抛出了一个错误。 Below is the code I tried to run and the error I get.下面是我试图运行的代码和我得到的错误。 How can I make this work?我怎样才能使这项工作?

import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
  File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
    full_df = dd.read_parquet(source_file,engine='pyarrow')
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
    storage_options=storage_options
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
    raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>

Dask does not support file-like objects directly. Dask 不直接支持类文件对象。

You would have to implement their "file system" interface .您必须实现他们的“文件系统”接口

I'm not sure what is minimal set of methods that you need to implement to allow read_parquet .我不确定您需要实现的最小方法集是什么以允许read_parquet But you definitely have to implement the open .但是您绝对必须实施open . Something like this:像这样的东西:

class SftpFileSystem(object):
    def open(self, path, mode='rb', **kwargs):
        return sftp_client.open(path, mode)

dask.bytes.core._filesystems['sftp'] = SftpFileSystem

df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')

There's actually am implementation of such file system for SFTP in fsspec library:在 fsspec 库中实际上有这样的 SFTP 文件系统的实现:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem

See also Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?另请参阅是否可以结合使用 Paramiko 和 Dask 的 read_csv() 方法从远程服务器读取 .csv?


Obligatory warning: Do not use AutoAddPolicy – You are losing a protection against MITM attacks by doing so.强制性警告:不要使用AutoAddPolicy – 这样做将失去对MITM 攻击的保护。 For a correct solution, see Paramiko "Unknown Server" .有关正确的解决方案,请参阅Paramiko“未知服务器”

The situation has changed, and you can do this now directly with Dask.情况发生了变化,您现在可以直接使用 Dask 执行此操作。 Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction? Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method?

In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems , see the mapping to protocol identifiers fsspec.registry.known_implementations .在 Dask 的主版本中,文件系统操作现在使用fsspec ,它与以前的实现(s3、gcs、hdfs)一起现在支持一些额外的文件系统,请参阅到协议标识符fsspec.registry.known_implementations的映射。

In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.简而言之,如果您从 master 安装 fsspec 和 Dask,使用像“sftp://user:pw@host:port/path”这样的 url 现在应该适合您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM