[英]Reading a multiline CSV file in Spark
I am trying to read a multiline csv file in spark.我正在尝试在 spark 中读取多行 csv 文件。 My schema is: Id, name and mark.
我的模式是:ID、名称和标记。 My input and actual output are given below.
下面给出了我的输入和实际 output。 I am not getting the expected output.
我没有得到预期的 output。 Can someone please help what I am missing in my code.
有人可以帮助我在我的代码中缺少什么。
Code:代码:
val myMarkDF = spark
.read
.format("csv")
.option("path","mypath\\marks.csv")
.option("inferSchema","true")
.option("multiLine","true")
.option("delimiter",",")
.load
Input:输入:
1,A,
97,,
1,A,98
1,A,
99,,
2,B,100
2,B,95
Actual output:实际 output:
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
| 1| A|null|
| 97|null|null|
| 1| A| 98|
| 1| A|null|
| 99|null|null|
| 2| B| 100|
| 2| B| 95|
+---+----+----+
Expected output:预期 output:
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
| 1| A| 97|
| 1| A| 98|
| 1| A| 99|
| 2| B| 100|
| 2| B| 95|
+---+----+----+
Thanks!谢谢!
EDIT: a better solution which handles more types of broken records (broken at 2nd or 3rd column).编辑:一种更好的解决方案,可以处理更多类型的损坏记录(在第 2 列或第 3 列损坏)。 The important part is the calculation of a cumsum of non-null entries, which groups together the rows that are supposed to be in the same record.
重要的部分是计算非空条目的累积和,它将应该在同一记录中的行组合在一起。
val df = spark.read.csv("file.csv")
df.show
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
| 1| A|null|
| 97|null|null|
| 1| A| 98|
| 1|null|null| <-- note that I intentionally changed these two rows
| A| 99|null| <-- to demonstrate how to handle two types of broken records
| 2| B| 100|
| 2| B| 95|
+---+----+----+
val df2 = df.withColumn(
"id", monotonically_increasing_id()
).withColumn(
"notnulls",
$"_c0".isNotNull.cast("int") + $"_c1".isNotNull.cast("int") + $"_c2".isNotNull.cast("int")
).withColumn(
"notnulls",
ceil(sum($"notnulls").over(Window.orderBy("id")) / 3)
).groupBy("notnulls").agg(
filter(
flatten(collect_list(array("_c0","_c1","_c2"))),
x => x.isNotNull
).alias("array")
).select(
$"array"(0).alias("c0"),
$"array"(1).alias("c1"),
$"array"(2).alias("c2")
)
df2.show
+---+---+---+
| c0| c1| c2|
+---+---+---+
| 1| A| 97|
| 1| A| 98|
| 1| A| 99|
| 2| B|100|
| 2| B| 95|
+---+---+---+
Old answer which doesn't work too well:旧答案不太好用:
Not the best way to parse a csv, but at least an MVP for your use case:不是解析 csv 的最佳方法,但至少是您的用例的 MVP:
val df = sc.wholeTextFiles("marks.csv").map(
row => row._2.replace(",,\n", "\n").replace(",\n", ",").split("\n")
).toDF(
"value"
).select(
explode($"value")
).select(
split($"col", ",").as("col")
).select(
$"col"(0), $"col"(1), $"col"(2)
)
df.show
+------+------+------+
|col[0]|col[1]|col[2]|
+------+------+------+
| 1| A| 97|
| 1| A| 98|
| 1| A| 99|
| 2| B| 100|
| 2| B| 95|
+------+------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.