简体   繁体   English

在 Spark 中读取多行 CSV 文件

[英]Reading a multiline CSV file in Spark

I am trying to read a multiline csv file in spark.我正在尝试在 spark 中读取多行 csv 文件。 My schema is: Id, name and mark.我的模式是:ID、名称和标记。 My input and actual output are given below.下面给出了我的输入和实际 output。 I am not getting the expected output.我没有得到预期的 output。 Can someone please help what I am missing in my code.有人可以帮助我在我的代码中缺少什么。

Code:代码:

val myMarkDF =   spark
                .read
                .format("csv")
                .option("path","mypath\\marks.csv")
                .option("inferSchema","true")
                .option("multiLine","true")
                .option("delimiter",",")
                .load

Input:输入:

1,A,
97,,
1,A,98
1,A,
99,,
2,B,100
2,B,95

Actual output:实际 output:

+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
|  1|   A|null|
| 97|null|null|
|  1|   A|  98|
|  1|   A|null|
| 99|null|null|
|  2|   B| 100|
|  2|   B|  95|
+---+----+----+

Expected output:预期 output:

+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
|  1|   A|  97|
|  1|   A|  98|
|  1|   A|  99|
|  2|   B| 100|
|  2|   B|  95|
+---+----+----+

Thanks!谢谢!

EDIT: a better solution which handles more types of broken records (broken at 2nd or 3rd column).编辑:一种更好的解决方案,可以处理更多类型的损坏记录(在第 2 列或第 3 列损坏)。 The important part is the calculation of a cumsum of non-null entries, which groups together the rows that are supposed to be in the same record.重要的部分是计算非空条目的累积和,它将应该在同一记录中的行组合在一起。

val df = spark.read.csv("file.csv")
df.show
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
|  1|   A|null|
| 97|null|null|
|  1|   A|  98|
|  1|null|null|   <-- note that I intentionally changed these two rows
|  A|  99|null|   <-- to demonstrate how to handle two types of broken records
|  2|   B| 100|
|  2|   B|  95|
+---+----+----+
val df2 = df.withColumn(
    "id", monotonically_increasing_id()
).withColumn(
    "notnulls",
    $"_c0".isNotNull.cast("int") + $"_c1".isNotNull.cast("int") + $"_c2".isNotNull.cast("int")
).withColumn(
    "notnulls",
    ceil(sum($"notnulls").over(Window.orderBy("id")) / 3)
).groupBy("notnulls").agg(
    filter(
        flatten(collect_list(array("_c0","_c1","_c2"))),
        x => x.isNotNull
    ).alias("array")
).select(
    $"array"(0).alias("c0"),
    $"array"(1).alias("c1"),
    $"array"(2).alias("c2")
)

df2.show
+---+---+---+
| c0| c1| c2|
+---+---+---+
|  1|  A| 97|
|  1|  A| 98|
|  1|  A| 99|
|  2|  B|100|
|  2|  B| 95|
+---+---+---+

Old answer which doesn't work too well:旧答案不太好用:

Not the best way to parse a csv, but at least an MVP for your use case:不是解析 csv 的最佳方法,但至少是您的用例的 MVP:

val df = sc.wholeTextFiles("marks.csv").map(
    row => row._2.replace(",,\n", "\n").replace(",\n", ",").split("\n")
).toDF(
    "value"
).select(
    explode($"value")
).select(
    split($"col", ",").as("col")
).select(
    $"col"(0), $"col"(1), $"col"(2)
)

df.show
+------+------+------+
|col[0]|col[1]|col[2]|
+------+------+------+
|     1|     A|    97|
|     1|     A|    98|
|     1|     A|    99|
|     2|     B|   100|
|     2|     B|    95|
+------+------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM