I have a fixed width text file(sample) with data
2107abc2018abn2019gfh
where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD.
2107abc
2018abn
2019gfh
where 2107
is one column and abc
is one more column
will the logic will be applicable for huge data file like 1 GB or more?
I'm amusing that you have RDD[String]
and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.
//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))
//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))
//print the rdd
res.foreach(println)
//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)
If you want you can also convert your RDD to dataframe for further processing.
//convert to DF
val df = res.toDF("col1","col2")
//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.