I am trying to serialize a Spark RDD by pickling it, and read the pickled file directly into Python.
a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')
I then copy the test_pkl files to my local. How can I read them directly into Python? When I try the normal pickle package, it fails when I attempt to read the first pickle part of 'test_pkl':
pickle.load(open('part-00000','rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/pickle.py", line 1370, in load
return Unpickler(file).load()
File "/usr/lib64/python2.6/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle
I assume that the pickling method that spark is using is different than the python pickle method (correct me if I am wrong). Is there any way for me to pickle data from Spark and read this pickled object directly into python from the file?
It is possible using sparkpickle project. As simple as
with open("/path/to/file", "rb") as f:
print(sparkpickle.load(f))
A better method might be to pickle the data in each partition, encode it, and write it to a text file:
import cPickle
import base64
def partition_to_encoded_pickle_object(partition):
p = [i for i in partition] # convert the RDD partition to a list
p = cPickle.dumps(p, protocol=2) # pickle the list
return [base64.b64encode(p)] # base64 encode the list, and return it in an iterable
my_rdd.mapPartitions(partition_to_encoded_pickle_object).saveAsTextFile("your/hdfs/path/")
After you download the file(s) to your local directory, you can use the following code segment to read it in:
# you first need to download the file, this step is not shown
# afterwards, you can use
path = "your/local/path/to/downloaded/files/"
data = []
for part in os.listdir(path):
if part[0] != "_": # this prevents system generated files from getting read - e.g. "_SUCCESS"
data += cPickle.loads(base64.b64decode((open(part,'rb').read())))
The problem is the format isn't a pickle file. It is a SequenceFile of pickled objects . The sequence file can be opened within Hadoop and Spark environments but isn't meant to be consumed in python and uses JVM-based serialization to serialize, what in this case is a list of strings.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.