How can I read RDD in my java program from a file that was created by Python program

Question

I have a python Spark program that creates features from raw data and stores them into a Pickle file using the saveAsPickleFile method. I can also use saveAsTextFile method.

The other program is written in Java implementing a classifier using ML.

Is it possible to read the serialized pickle file into an RDDs in Java?

Answer 1

saveAsPickleFile is using standard pickle module. It is possible to read object serialized using pickle for example using Jython pickle but it is far from straightforward
saveAsTextFile creates a plain text file. There is nor reason why it couldn't be loaded in Java. Problem is you still have to parse the content. PySpark version of saveAsTextFile is simply using unicode method which doesn't have to return any meaningful representation. If you want something that can be easily loaded it is a good idea to manually create string representation

for key-value the simplest thing is to use saveAsSequenceFile / sequenceFile :

 rdd = sc.parallelize([(1L, "foo"), (2L, "bar")]) rdd.saveAsSequenceFile("pairs")

 sc.sequenceFile[Long, String]("pairs").collect() // Array[(Long, String)] = Array((2,bar), (1,foo))

if you have more complex data you can use Parquet files:

 from pyspark.mllib.linalg import DenseVector rdd = sc.parallelize([ (1L, DenseVector([1, 2])), (2L, DenseVector([3, 4]))]) rdd.toDF().write.parquet("pairs_parquet")

 sqlContext.read.parquet("pairs_parquet").rdd.collect() // Array[org.apache.spark.sql.Row] = Array([2,[3.0,4.0]], [1,[1.0,2.0]])

Avro or even simple JSON could be a viable solution as well.

How can I read RDD in my java program from a file that was created by Python program

Question

1 answers

solution1
0 2015-10-23 19:08:25

How can I read RDD in my java program from a file that was created by Python program

Question

1 answers

solution1 0 2015-10-23 19:08:25

solution1
0 2015-10-23 19:08:25