将RDD数据按火花标量分成多行

Question

I have a fixed width text file(sample) with data 我有一个带有数据的固定宽度的文本文件（样本）

2107abc2018abn2019gfh 2107abc2018abn2019gfh

where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD. 其中所有行数据都合并为单个行，我需要读取文本文件并根据固定行长= 7拆分数据，并生成多行并将其存储在RDD中。

2107abc

2018abn

2019gfh

where 2107 is one column and abc is one more column 其中2107是一栏，而abc是另一栏

will the logic will be applicable for huge data file like 1 GB or more? 该逻辑将适用于1 GB或更大的海量数据文件吗？

Answer 1

I'm amusing that you have RDD[String] and you want to extract both columns from your data. 我很有趣，您有RDD[String]并且想从数据中提取两个列。 First you can split the line at length 7 and then again at 4. You will get your columns separated. 首先，您可以分割长度为7的行，然后再分割为4的行。您将使列分开。 Below is the code for same. 下面是相同的代码。

//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))

//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))

//print the rdd
res.foreach(println)

//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)

If you want you can also convert your RDD to dataframe for further processing. 如果需要，还可以将RDD转换为数据帧以进行进一步处理。

//convert to DF
val df = res.toDF("col1","col2")

//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+

将RDD数据按火花标量分成多行

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-12-19 07:32:47

将RDD数据按火花标量分成多行

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-12-19 07:32:47

解决方案1
1 已采纳 2018-12-19 07:32:47