简体   繁体   中英

Spark RDD mapping questions

I have a text data as below

no1      1|3|4
no2      4|5|6

and I want to transform the above data as below using spark RDD and scala language.

no1      1
no1      3
no1      4
no2      4
no2      5
no2      6

I am very new to Spark and Scala. And I can't find any example that does this.

I recommend you to read in the file as data frame, whose API will be put more emphasis on in the future spark version than RDD API. And with a data frame, the task you are asking is fairly straightforward with split and explode functions:

val df = Seq(("no1", "1|3|4"), ("no2", "4|5|6")).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: string]

df.show
+---+-----+
|  A|    B|
+---+-----+
|no1|1|3|4|
|no2|4|5|6|
+---+-----+


df.withColumn("B", explode(split($"B", "\\|"))).show
+---+---+
|  A|  B|
+---+---+
|no1|  1|
|no1|  3|
|no1|  4|
|no2|  4|
|no2|  5|
|no2|  6|
+---+---+

Suppose that you have your input in a file called inputFile.txt.

  1. Read the File

    >>>inputRDD = sc.textFile("Documents/SparkPractice/inputFile.txt")
  2. The file will be read as:

     >>>inputRDD.collect()

['no1 1|3|4', 'no2 4|5|6']

  1. Now, first split each line,ie 'no1 1|3|4' and 'no2 4|5|6' according to the space.

     >>> rdd1 = inputRDD.map(lambda x: x.split(' ')) >>> rdd1.collect()

[['no1', '1|3|4'], ['no2', '4|5|6']]

  1. Now, we need to split '1|3|4' and '4|5|6'. rdd1 has two indexes in each element(like no1 and 1|3|4, no2 and 4|5|6). Iterate over each element of rdd1 using lambda and in each element, concatenate x[0]=no1,a space and x[1].split('|')=[1,3,4] using list comprehension. Similary, concatenate the second element x[0]=no2, a space and x[1].split('|')=[4,5,6].

     rdd2 = rdd1.map(lambda x: [x[0]+' '+y for y in x[1].split('|')]) rdd2.collect()

[['no1 1', 'no1 3', 'no1 4'], ['no2 4', 'no2 5', 'no2 6']]

  1. Finally, flatten the rdd2. FlatMap will collapse all lists and put them in a single list:

     rdd3 = rdd2.flatMap(lambda x: x) rdd3.collect()

['no1 1', 'no1 3', 'no1 4', 'no2 4', 'no2 5', 'no2 6']

  1. You can finally combine all these steps as:

     rdd1 = inputRDD.map(lambda x: x.split(' ')).flatMap(lambda x: [x[0]+' '+y for y in x[1].split('|')])
  2. Save this in your output file by collapsing all partitions to a single partition:

     rdd1.coalesce(1).saveAsTextFile("Documents/SparkPractice/outputFile")

Hope that my answer helps you!

We can read text file and simply use rdd transformations for your solution

val rrd=spark.sparkContext.textFile("file_path").map(x=>x.split("\t")).map(x=>(x.head,x.last))
val trdd=rdd.map{case(k,v)=> v.split("\\|").map((k,_))}
trdd.collect.foreach(x=>x.foreach(x=>println(x._1+"\t"+x._2)))


o/p looks like 
no1 1
no1 3
no1 4
no2 4
no2 5
no2 6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM