![](/img/trans.png)
[英]How to split a data with different delimiter in single RDD in spark scala?
[英]How to Split the row by nth delimiter in Spark Scala
我有以下数据存储在 csv 文件中
1|Roy|NA|2|Marry|4.6|3|Richard|NA|4|Joy|NA|5|Joe|NA|6|Jos|9|
现在我想读取文件并将其存储在 spark 数据帧中,然后再将其存储到数据帧中,我想每 3 次拆分一次|
并将其存储为一行。
预期输出:
1|Roy|NA|
2|Marry|4.6|
3|Richard|NA|
4|Joy|NA|
5|Joe|NA|
6|Jos|9|
你能帮我得到上面的输出吗?
首先阅读您的 csv 文件
val df = spark.read.option("delimiter", "|").csv(file)
这会给你这个数据框
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
由于 csv 文件中的最后一个分隔符创建了最后一列,因此我们将其删除
val dataframe = df.drop(df.schema.last.name)
dataframe.show(false)
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
然后,您需要创建一个数组,其中包含您需要在最终数据框中拥有的列名称列表
val names : Array[String] = Array("colOne", "colTwo", "colThree")
最后,您需要一个读取 3 的函数
def splitCSV(dataFrame: DataFrame, columnNames : Array[String], sparkSession: SparkSession) : DataFrame = {
import sparkSession.implicits._
val columns = dataFrame.columns
var finalDF : DataFrame = Seq.empty[(String,String,String)].toDF(columnNames:_*)
for(order <- 0 until(columns.length) -3 by(3) ){
finalDF = finalDF.union(dataFrame.select(col(columns(order)).as(columnNames(0)), col(columns(order+1)).as(columnNames(1)), col(columns(order+2)).as(columnNames(2))))
}
finalDF
}
在我们在数据帧上应用这个函数之后
val finalDF = splitCSV(dataframe, names, sparkSession)
finalDF.show(false)
+------+-------+--------+
|colOne|colTwo |colThree|
+------+-------+--------+
|1 |Roy |NA |
|1 |Roy |NA |
|1 |Roy |NA |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|3 |Richard|NA |
|3 |Richard|NA |
|3 |Richard|NA |
|4 |Joy |NA |
|4 |Joy |NA |
|4 |Joy |NA |
|5 |Joe |NA |
|5 |Joe |NA |
|5 |Joe |NA |
+------+-------+--------+
大多数情况下,您可以使用正则表达式。 “在第 n 个匹配出现时拆分”没有直接的正则表达式,因此我们通过使用匹配来挑选模式,然后插入一个我们可以使用的自定义拆分器来解决它。
ds
.withColumn("value",
regexp_replace('value, "([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|", "$1|$2|$3||")) // 1
.withColumn("value", explode(split('value, "\\|\\|"))) // 2
.where(length('value) > 0) // 3
解释
||
终止||
并使用explode
将每个移动到单独的行split
在最后选择了空组,因此我们将其过滤掉给定输入的输出:
+------------+
|value |
+------------+
|1|Roy|NA |
|2|Marry|4.6 |
|3|Richard|NA|
|4|Joy|NA |
|5|Joe|NA |
|6|Jos|9 |
+------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.