I have a python Spark program that creates features from raw data and stores them into a Pickle file using the saveAsPickleFile
method. I can also use saveAsTextFile
method.
The other program is written in Java implementing a classifier using ML.
Is it possible to read the serialized pickle file into an RDDs in Java?
saveAsPickleFile
is using standard pickle
module. It is possible to read object serialized using pickle
for example using Jython pickle
but it is far from straightforward saveAsTextFile
creates a plain text file. There is nor reason why it couldn't be loaded in Java. Problem is you still have to parse the content. PySpark version of saveAsTextFile
is simply using unicode
method which doesn't have to return any meaningful representation. If you want something that can be easily loaded it is a good idea to manually create string representation for key-value the simplest thing is to use saveAsSequenceFile
/ sequenceFile
:
rdd = sc.parallelize([(1L, "foo"), (2L, "bar")]) rdd.saveAsSequenceFile("pairs")
sc.sequenceFile[Long, String]("pairs").collect() // Array[(Long, String)] = Array((2,bar), (1,foo))
if you have more complex data you can use Parquet files:
from pyspark.mllib.linalg import DenseVector rdd = sc.parallelize([ (1L, DenseVector([1, 2])), (2L, DenseVector([3, 4]))]) rdd.toDF().write.parquet("pairs_parquet")
sqlContext.read.parquet("pairs_parquet").rdd.collect() // Array[org.apache.spark.sql.Row] = Array([2,[3.0,4.0]], [1,[1.0,2.0]])
Avro or even simple JSON could be a viable solution as well.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.