Spark RDD mapping questions

Question

I have a text data as below

no1      1|3|4
no2      4|5|6

and I want to transform the above data as below using spark RDD and scala language.

I am very new to Spark and Scala. And I can't find any example that does this.

Answer 1

I recommend you to read in the file as data frame, whose API will be put more emphasis on in the future spark version than RDD API. And with a data frame, the task you are asking is fairly straightforward with split and explode functions:

val df = Seq(("no1", "1|3|4"), ("no2", "4|5|6")).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: string]

df.show
+---+-----+
|  A|    B|
+---+-----+
|no1|1|3|4|
|no2|4|5|6|
+---+-----+


df.withColumn("B", explode(split($"B", "\\|"))).show
+---+---+
|  A|  B|
+---+---+
|no1|  1|
|no1|  3|
|no1|  4|
|no2|  4|
|no2|  5|
|no2|  6|
+---+---+

Answer 2

Suppose that you have your input in a file called inputFile.txt.

Read the File

>>>inputRDD = sc.textFile("Documents/SparkPractice/inputFile.txt")

The file will be read as:
```
 >>>inputRDD.collect()
```

['no1 1|3|4', 'no2 4|5|6']

Now, first split each line,ie 'no1 1|3|4' and 'no2 4|5|6' according to the space.
```
 >>> rdd1 = inputRDD.map(lambda x: x.split(' ')) >>> rdd1.collect()
```

[['no1', '1|3|4'], ['no2', '4|5|6']]

Now, we need to split '1|3|4' and '4|5|6'. rdd1 has two indexes in each element(like no1 and 1|3|4, no2 and 4|5|6). Iterate over each element of rdd1 using lambda and in each element, concatenate x[0]=no1,a space and x[1].split('|')=[1,3,4] using list comprehension. Similary, concatenate the second element x[0]=no2, a space and x[1].split('|')=[4,5,6].
```
 rdd2 = rdd1.map(lambda x: [x[0]+' '+y for y in x[1].split('|')]) rdd2.collect()
```

[['no1 1', 'no1 3', 'no1 4'], ['no2 4', 'no2 5', 'no2 6']]

Finally, flatten the rdd2. FlatMap will collapse all lists and put them in a single list:
```
 rdd3 = rdd2.flatMap(lambda x: x) rdd3.collect()
```

['no1 1', 'no1 3', 'no1 4', 'no2 4', 'no2 5', 'no2 6']

You can finally combine all these steps as:

 rdd1 = inputRDD.map(lambda x: x.split(' ')).flatMap(lambda x: [x[0]+' '+y for y in x[1].split('|')])

Save this in your output file by collapsing all partitions to a single partition:
```
 rdd1.coalesce(1).saveAsTextFile("Documents/SparkPractice/outputFile")
```

Hope that my answer helps you!

Answer 3

We can read text file and simply use rdd transformations for your solution

val rrd=spark.sparkContext.textFile("file_path").map(x=>x.split("\t")).map(x=>(x.head,x.last))
val trdd=rdd.map{case(k,v)=> v.split("\\|").map((k,_))}
trdd.collect.foreach(x=>x.foreach(x=>println(x._1+"\t"+x._2)))


o/p looks like 
no1 1
no1 3
no1 4
no2 4
no2 5
no2 6

Spark RDD mapping questions

Question

3 answers

solution1
4 ACCPTED 2017-01-24 03:17:00

solution2
1 2022-03-31 07:28:20

solution3
0 2018-12-10 14:18:43

Spark RDD mapping questions

Question

3 answers

solution1 4 ACCPTED 2017-01-24 03:17:00

solution2 1 2022-03-31 07:28:20

solution3 0 2018-12-10 14:18:43

solution1
4 ACCPTED 2017-01-24 03:17:00

solution2
1 2022-03-31 07:28:20

solution3
0 2018-12-10 14:18:43