简体   繁体   中英

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?

// reading a text file "b.txt" and creating RDD 
val rdd = sc.textFile("/home/training/desktop/b.txt") 

b.txt dataset -->

 Ankita,26,BigData,newbie
 Shikha,30,Management,Expert

If you are intending to have Array[Tuples4] then you can do the following

scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24

scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))

Then you can access each fields as tuples

scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())

Updated

If you have variable sized input file as

Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big

you can write match case pattern matching as

scala> val arrayTuples = rdd.map(line => line.split(",") match {
     | case Array(a, b, c, d) => (a,b,c,d)
     | case Array(a,b,c) => (a,b,c)
     | }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))

Updated again

As @eliasah pointed that above procedure is a bad practice which is using product iterator . As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements

val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect

And as @philantrovert pointed out, we can verify the output in the following way, if we are not using REPL

arrayTuples.foreach(println)

which results to

(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM