简体   繁体   English

如何使用 Python pickle 将文件转储到 Hadoop HDFS 目录?

[英]How to dump a file to a Hadoop HDFS directory using Python pickle?

I am on a VM in a directory that contains my Python (2.7) class.我在包含我的 Python (2.7) 类的目录中的 VM 上。 I am trying to pickle an instance of my class to a directory in my HDFS.我正在尝试将我的类的一个实例腌制到我的 HDFS 中的一个目录中。

I'm trying to do something along the lines of:我正在尝试按照以下方式做一些事情:

import pickle

my_obj = MyClass() # the class instance that I want to pickle

with open('hdfs://domain.example.com/path/to/directory/') as hdfs_loc:
    pickle.dump(my_obj, hdfs_loc)

From what research I've done, I think something like snakebite might be able to help...but does anyone have more concrete suggestions?根据我所做的研究,我认为像蛇咬这样的东西可能会有所帮助……但是有人有更具体的建议吗?

If you use PySpark, then you can use the saveAsPickleFile method:如果你使用 PySpark,那么你可以使用saveAsPickleFile方法:

temp_rdd = sc.parallelize(my_obj)
temp_rdd.coalesce(1).saveAsPickleFile("/test/tmp/data/destination.pickle")

Here is a work around if you are running in a Jupyter notebook with sufficient permissions:如果您在具有足够权限的 Jupyter 笔记本中运行,这是一个解决方法:

import pickle

my_obj = MyClass() # the class instance that I want to pickle
local_filename = "pickle.p"
hdfs_loc = "//domain.example.com/path/to/directory/"
with open(local_filename, 'wb') as f:
    pickle.dump(my_obj, f)
!!hdfs dfs -copyFromLocal $local_filename  $hdfs_loc

You can dump Pickle object to HDFS with PyArrow :您可以使用PyArrow将 Pickle 对象转储到 HDFS:

import pickle
import pyarrow as pa

my_obj = MyClass() # the class instance that I want to pickle

hdfs = pa.hdfs.connect()
with hdfs.open('hdfs://domain.example.com/path/to/directory/filename.pkl', 'wb') as hdfs_file:
    pickle.dump(my_obj, hdfs_file)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM