RDD data into multiple rows in spark-scala

Question

I have a fixed width text file(sample) with data

2107abc2018abn2019gfh

where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD.

2107abc

2018abn

2019gfh

where 2107 is one column and abc is one more column

will the logic will be applicable for huge data file like 1 GB or more?

Answer 1

I'm amusing that you have RDD[String] and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.

//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))

//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))

//print the rdd
res.foreach(println)

//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)

If you want you can also convert your RDD to dataframe for further processing.

//convert to DF
val df = res.toDF("col1","col2")

//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+

RDD data into multiple rows in spark-scala

Question

1 answers

solution1
1 ACCPTED 2018-12-19 07:32:47

RDD data into multiple rows in spark-scala

Question

1 answers

solution1 1 ACCPTED 2018-12-19 07:32:47

solution1
1 ACCPTED 2018-12-19 07:32:47