简体   繁体   中英

RDD data into multiple rows in spark-scala

I have a fixed width text file(sample) with data

2107abc2018abn2019gfh

where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD.

2107abc

2018abn

2019gfh

where 2107 is one column and abc is one more column

will the logic will be applicable for huge data file like 1 GB or more?

I'm amusing that you have RDD[String] and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.

//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))

//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))

//print the rdd
res.foreach(println)

//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)

If you want you can also convert your RDD to dataframe for further processing.

//convert to DF
val df = res.toDF("col1","col2")

//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM