简体   繁体   中英

How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]?

I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.

11,John,Paris,1000
12,Daniel,UK,3000 

first step, I create an RDD with RDD[String] by below code,

val empRDD = spark
  .sparkContext
  .textFile("empInfo.txt")

So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String] ). For that I have tried below code hence I am getting RDD[String, String, String] .

val empReqRDD = empRDD
  .map(a=> a.split(","))
  .map(x=> (x(0), x(1), x(2)))

I have tried with Slice also, it gives me RDD[Array(String)] . My required RDD should be of RDD[String] to pass to required Scala class to do some operations.

The expected output should be,

11,John,Paris
12,Daniel,UK

Can anyone help me how to achieve?

I would try this

val empReqRDD = empRDD
  .map(a=> a.split(","))
  .map(x=> (x(0), x(1), x(2)))

val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)}) 

In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].

One way to accomplish your objective is to change the second map to construct a string like so:

empRDD
  .map(a=> a.split(","))
  .map(x => s"${x(0)},${x(1)},${x(2)}")

Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:

empRDD.map(_.split(',').take(3).mkString(","))

Probably overkill for this use-case, but you could also use a regex to extract the values:

val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM