[英]Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet
I tried to pass class paramiko.sftp_file.SFTPFile
instead of file URL for pandas.read_parquet
and it worked fine.我试图通过类paramiko.sftp_file.SFTPFile
而不是文件的URL pandas.read_parquet
它工作得很好。 But when I tried the same with Dask, it threw an error.但是当我用 Dask 尝试同样的方法时,它抛出了一个错误。 Below is the code I tried to run and the error I get.下面是我试图运行的代码和我得到的错误。 How can I make this work?我怎样才能使这项工作?
import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
full_df = dd.read_parquet(source_file,engine='pyarrow')
File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
storage_options=storage_options
File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>
Dask does not support file-like objects directly. Dask 不直接支持类文件对象。
You would have to implement their "file system" interface .您必须实现他们的“文件系统”接口。
I'm not sure what is minimal set of methods that you need to implement to allow read_parquet
.我不确定您需要实现的最小方法集是什么以允许read_parquet
。 But you definitely have to implement the open
.但是您绝对必须实施open
. Something like this:像这样的东西:
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')
There's actually am implementation of such file system for SFTP in fsspec library:在 fsspec 库中实际上有这样的 SFTP 文件系统的实现:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem
See also Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?另请参阅是否可以结合使用 Paramiko 和 Dask 的 read_csv() 方法从远程服务器读取 .csv?
Obligatory warning: Do not use AutoAddPolicy
– You are losing a protection against MITM attacks by doing so.强制性警告:不要使用AutoAddPolicy
– 这样做将失去对MITM 攻击的保护。 For a correct solution, see Paramiko "Unknown Server" .有关正确的解决方案,请参阅Paramiko“未知服务器” 。
The situation has changed, and you can do this now directly with Dask.情况发生了变化,您现在可以直接使用 Dask 执行此操作。 Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction? Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method?
In the master version of Dask, file-system operations are now using fsspec
which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems , see the mapping to protocol identifiers fsspec.registry.known_implementations
.在 Dask 的主版本中,文件系统操作现在使用fsspec
,它与以前的实现(s3、gcs、hdfs)一起现在支持一些额外的文件系统,请参阅到协议标识符fsspec.registry.known_implementations
的映射。
In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.简而言之,如果您从 master 安装 fsspec 和 Dask,使用像“sftp://user:pw@host:port/path”这样的 url 现在应该适合您。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.