[英]RDD data into multiple rows in spark-scala
I have a fixed width text file(sample) with data 我有一个带有数据的固定宽度的文本文件(样本)
2107abc2018abn2019gfh
2107abc2018abn2019gfh
where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD. 其中所有行数据都合并为单个行,我需要读取文本文件并根据固定行长= 7拆分数据,并生成多行并将其存储在RDD中。
2107abc
2018abn
2019gfh
where 2107
is one column and abc
is one more column 其中
2107
是一栏,而abc
是另一栏
will the logic will be applicable for huge data file like 1 GB or more? 该逻辑将适用于1 GB或更大的海量数据文件吗?
I'm amusing that you have RDD[String]
and you want to extract both columns from your data. 我很有趣,您有
RDD[String]
并且想从数据中提取两个列。 First you can split the line at length 7 and then again at 4. You will get your columns separated. 首先,您可以分割长度为7的行,然后再分割为4的行。您将使列分开。 Below is the code for same.
下面是相同的代码。
//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))
//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))
//print the rdd
res.foreach(println)
//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)
If you want you can also convert your RDD to dataframe for further processing. 如果需要,还可以将RDD转换为数据帧以进行进一步处理。
//convert to DF
val df = res.toDF("col1","col2")
//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.