I have a text data as below
no1 1|3|4 no2 4|5|6
and I want to transform the above data as below using spark RDD and scala language.
no1 1 no1 3 no1 4 no2 4 no2 5 no2 6
I am very new to Spark and Scala. And I can't find any example that does this.
I recommend you to read in the file as data frame, whose API will be put more emphasis on in the future spark version than RDD API. And with a data frame, the task you are asking is fairly straightforward with split
and explode
functions:
val df = Seq(("no1", "1|3|4"), ("no2", "4|5|6")).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: string]
df.show
+---+-----+
| A| B|
+---+-----+
|no1|1|3|4|
|no2|4|5|6|
+---+-----+
df.withColumn("B", explode(split($"B", "\\|"))).show
+---+---+
| A| B|
+---+---+
|no1| 1|
|no1| 3|
|no1| 4|
|no2| 4|
|no2| 5|
|no2| 6|
+---+---+
Suppose that you have your input in a file called inputFile.txt.
Read the File
>>>inputRDD = sc.textFile("Documents/SparkPractice/inputFile.txt")
The file will be read as:
>>>inputRDD.collect()
['no1 1|3|4', 'no2 4|5|6']
Now, first split each line,ie 'no1 1|3|4' and 'no2 4|5|6' according to the space.
>>> rdd1 = inputRDD.map(lambda x: x.split(' ')) >>> rdd1.collect()
[['no1', '1|3|4'], ['no2', '4|5|6']]
Now, we need to split '1|3|4' and '4|5|6'. rdd1 has two indexes in each element(like no1 and 1|3|4, no2 and 4|5|6). Iterate over each element of rdd1 using lambda and in each element, concatenate x[0]=no1,a space and x[1].split('|')=[1,3,4] using list comprehension. Similary, concatenate the second element x[0]=no2, a space and x[1].split('|')=[4,5,6].
rdd2 = rdd1.map(lambda x: [x[0]+' '+y for y in x[1].split('|')]) rdd2.collect()
[['no1 1', 'no1 3', 'no1 4'], ['no2 4', 'no2 5', 'no2 6']]
Finally, flatten the rdd2. FlatMap will collapse all lists and put them in a single list:
rdd3 = rdd2.flatMap(lambda x: x) rdd3.collect()
['no1 1', 'no1 3', 'no1 4', 'no2 4', 'no2 5', 'no2 6']
You can finally combine all these steps as:
rdd1 = inputRDD.map(lambda x: x.split(' ')).flatMap(lambda x: [x[0]+' '+y for y in x[1].split('|')])
Save this in your output file by collapsing all partitions to a single partition:
rdd1.coalesce(1).saveAsTextFile("Documents/SparkPractice/outputFile")
Hope that my answer helps you!
We can read text file and simply use rdd transformations for your solution
val rrd=spark.sparkContext.textFile("file_path").map(x=>x.split("\t")).map(x=>(x.head,x.last))
val trdd=rdd.map{case(k,v)=> v.split("\\|").map((k,_))}
trdd.collect.foreach(x=>x.foreach(x=>println(x._1+"\t"+x._2)))
o/p looks like
no1 1
no1 3
no1 4
no2 4
no2 5
no2 6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.