简体   繁体   English

将RDD数据按火花标量分成多行

[英]RDD data into multiple rows in spark-scala

I have a fixed width text file(sample) with data 我有一个带有数据的固定宽度的文本文件(样本)

2107abc2018abn2019gfh 2107abc2018abn2019gfh

where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD. 其中所有行数据都合并为单个行,我需要读取文本文件并根据固定行长= 7拆分数据,并生成多行并将其存储在RDD中。

2107abc

2018abn

2019gfh

where 2107 is one column and abc is one more column 其中2107是一栏,而abc是另一栏

will the logic will be applicable for huge data file like 1 GB or more? 该逻辑将适用于1 GB或更大的海量数据文件吗?

I'm amusing that you have RDD[String] and you want to extract both columns from your data. 我很有趣,您有RDD[String]并且想从数据中提取两个列。 First you can split the line at length 7 and then again at 4. You will get your columns separated. 首先,您可以分割长度为7的行,然后再分割为4的行。您将使列分开。 Below is the code for same. 下面是相同的代码。

//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))

//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))

//print the rdd
res.foreach(println)

//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)

If you want you can also convert your RDD to dataframe for further processing. 如果需要,还可以将RDD转换为数据帧以进行进一步处理。

//convert to DF
val df = res.toDF("col1","col2")

//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM