在 Spark 中读取多行 CSV 文件

Question

I am trying to read a multiline csv file in spark.我正在尝试在 spark 中读取多行 csv 文件。 My schema is: Id, name and mark.我的模式是：ID、名称和标记。 My input and actual output are given below.下面给出了我的输入和实际 output。 I am not getting the expected output.我没有得到预期的 output。 Can someone please help what I am missing in my code.有人可以帮助我在我的代码中缺少什么。

Code:代码：

val myMarkDF =   spark
                .read
                .format("csv")
                .option("path","mypath\\marks.csv")
                .option("inferSchema","true")
                .option("multiLine","true")
                .option("delimiter",",")
                .load

Input:输入：

1,A,
97,,
1,A,98
1,A,
99,,
2,B,100
2,B,95

Actual output:实际 output：

+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
|  1|   A|null|
| 97|null|null|
|  1|   A|  98|
|  1|   A|null|
| 99|null|null|
|  2|   B| 100|
|  2|   B|  95|
+---+----+----+

Expected output:预期 output：

+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
|  1|   A|  97|
|  1|   A|  98|
|  1|   A|  99|
|  2|   B| 100|
|  2|   B|  95|
+---+----+----+

Thanks!谢谢！

Answer 1

EDIT: a better solution which handles more types of broken records (broken at 2nd or 3rd column).编辑：一种更好的解决方案，可以处理更多类型的损坏记录（在第 2 列或第 3 列损坏）。 The important part is the calculation of a cumsum of non-null entries, which groups together the rows that are supposed to be in the same record.重要的部分是计算非空条目的累积和，它将应该在同一记录中的行组合在一起。

val df = spark.read.csv("file.csv")
df.show
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
|  1|   A|null|
| 97|null|null|
|  1|   A|  98|
|  1|null|null|   <-- note that I intentionally changed these two rows
|  A|  99|null|   <-- to demonstrate how to handle two types of broken records
|  2|   B| 100|
|  2|   B|  95|
+---+----+----+

val df2 = df.withColumn(
    "id", monotonically_increasing_id()
).withColumn(
    "notnulls",
    $"_c0".isNotNull.cast("int") + $"_c1".isNotNull.cast("int") + $"_c2".isNotNull.cast("int")
).withColumn(
    "notnulls",
    ceil(sum($"notnulls").over(Window.orderBy("id")) / 3)
).groupBy("notnulls").agg(
    filter(
        flatten(collect_list(array("_c0","_c1","_c2"))),
        x => x.isNotNull
    ).alias("array")
).select(
    $"array"(0).alias("c0"),
    $"array"(1).alias("c1"),
    $"array"(2).alias("c2")
)

df2.show
+---+---+---+
| c0| c1| c2|
+---+---+---+
|  1|  A| 97|
|  1|  A| 98|
|  1|  A| 99|
|  2|  B|100|
|  2|  B| 95|
+---+---+---+

Old answer which doesn't work too well:旧答案不太好用：

Not the best way to parse a csv, but at least an MVP for your use case:不是解析 csv 的最佳方法，但至少是您的用例的 MVP：

val df = sc.wholeTextFiles("marks.csv").map(
    row => row._2.replace(",,\n", "\n").replace(",\n", ",").split("\n")
).toDF(
    "value"
).select(
    explode($"value")
).select(
    split($"col", ",").as("col")
).select(
    $"col"(0), $"col"(1), $"col"(2)
)

df.show
+------+------+------+
|col[0]|col[1]|col[2]|
+------+------+------+
|     1|     A|    97|
|     1|     A|    98|
|     1|     A|    99|
|     2|     B|   100|
|     2|     B|    95|
+------+------+------+

在 Spark 中读取多行 CSV 文件

问题描述

1 个解决方案

解决方案1
1 2020-12-20 07:56:10

在 Spark 中读取多行 CSV 文件

问题描述

1 个解决方案

解决方案1 1 2020-12-20 07:56:10

解决方案1
1 2020-12-20 07:56:10