简体   繁体   中英

Adding contents in an RDD[(Array[String], Long)] into a new array into a new RDD: RDD[Array[(Array[String], Long)]]

I have an RDD[Array[String]] which I zipWithIndex:

val dataWithIndex = data.zipWithIndex()

Now I have a RDD[(Array[String], Long)] , I would like to add all the pairs in the RDD to an array and still have it in the RDD. Is there an efficient way to do so? My final datastructure should be RDD[Array[(Array[String], Long)]] where the RDD essentially only contains one element.

Right now I do the following, but it is very ineffective because of collect() :

val dataWithIndex = data.zipWithIndex()
val dataNoRDD = dataWithIndex.collect()
val dataArr = ListBuffer[Array[(Array[String], Long)]]()
dataArr += dataNoRDD
val initData = sc.parallelize(dataArr)

The conclusion is that this seems to be extremely hard to do with standard functionality.

Instead, if the input comes from a Hadoop filesystem it is possible to do. This can be done by extending certain Hadoop classes.

First you need to implement WritableComparable<> and define a custom format that the RDD will contain. In order for this to work, you need to define a custom FileInputFormat and extend it in order to support your custom Writable . In order for FileInputFormat to know what to do with data being read, a custom RecordReader has to be written by extending it and here specifically the method nextKeyValue() has to be written which defines what each RDD element will contain. All of these three are written in Java, but with some simple tricks it is possible to do.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM