简体   繁体   中英

How can I read RDD in my java program from a file that was created by Python program

I have a python Spark program that creates features from raw data and stores them into a Pickle file using the saveAsPickleFile method. I can also use saveAsTextFile method.

The other program is written in Java implementing a classifier using ML.

Is it possible to read the serialized pickle file into an RDDs in Java?

  • saveAsPickleFile is using standard pickle module. It is possible to read object serialized using pickle for example using Jython pickle but it is far from straightforward
  • saveAsTextFile creates a plain text file. There is nor reason why it couldn't be loaded in Java. Problem is you still have to parse the content. PySpark version of saveAsTextFile is simply using unicode method which doesn't have to return any meaningful representation. If you want something that can be easily loaded it is a good idea to manually create string representation
  • for key-value the simplest thing is to use saveAsSequenceFile / sequenceFile :

     rdd = sc.parallelize([(1L, "foo"), (2L, "bar")]) rdd.saveAsSequenceFile("pairs") 
     sc.sequenceFile[Long, String]("pairs").collect() // Array[(Long, String)] = Array((2,bar), (1,foo)) 
  • if you have more complex data you can use Parquet files:

     from pyspark.mllib.linalg import DenseVector rdd = sc.parallelize([ (1L, DenseVector([1, 2])), (2L, DenseVector([3, 4]))]) rdd.toDF().write.parquet("pairs_parquet") 
     sqlContext.read.parquet("pairs_parquet").rdd.collect() // Array[org.apache.spark.sql.Row] = Array([2,[3.0,4.0]], [1,[1.0,2.0]]) 
  • Avro or even simple JSON could be a viable solution as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM