简体   繁体   English

在熊猫中访问存储在s3上的HDF文件

[英]Access HDF files stored on s3 in pandas

I'm storing pandas data frames dumped in HDF format on S3. 我将以HDF格式转储的熊猫数据帧存储在S3上。 I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf . 由于无法将文件指针,URL,s3 URL或StringIO对象传递给read_hdf ,我几乎陷入了read_hdf If I understand it correctly the file must be present on the filesystem. 如果我正确理解该文件,则该文件必须存在于文件系统中。

Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315 来源: https : //github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315

It looks like it's implemented for CSV but not for HDF. 看起来它是为CSV实现的,但不是为HDF实现的。 Is there any better way to open those HDF files than copy them to the filesystem? 有没有比将它们复制到文件系统更好的方法来打开这些HDF文件?

For the record, these HDF files are being handled on a web server, that's why I don't want a local copy. 记录下来,这些HDF文件是在Web服务器上处理的,这就是为什么我不需要本地副本的原因。

If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done? 如果我需要坚持使用本地文件:有什么方法可以在文件系统上模拟该文件(具有真实路径),该文件可以在读取完成后销毁?

I'm using Python 2.7 with Django 1.9 and pandas 0.18.1. 我正在将Python 2.7与Django 1.9和pandas 0.18.1一起使用。

Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf documentation . read_hdf文档中所述,较新版本的python允许直接从S3读取read_hdf Perhaps you should upgrade pandas if you can. 如果可以的话,也许您应该升级熊猫。 This of course assumes you've set the right access rights to read those files: either with a credentials file or with public ACLs. 当然,这假定您已经设置了读取这些文件的正确访问权限:使用credentials文件或使用公共ACL。

Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. 关于您的最后评论,我不确定为什么每df存储几个HDF5必然与HDF5的使用相反。 Pickle should be much slower than HDF5 though joblib.dump might partially improve on this. Pickle应该比HDF5慢得多,尽管joblib.dump可能对此有所改善。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM