在讀取csv時，Spark將列添加到數據框

Question

我有一個csv，其數據形狀如下：

0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"

我想將它轉換為一個數據框，最后一列名為“value”。 我已經在Scala中編寫了這段代碼：

val rawdf = spark.read.format("csv")
                 .option("header", "true")
                 .option("delimiter", ";")
                 .load(CSVPATH)

但是我用rawdf.show(numRows = 4)獲得了這個結果：

+---+---+---+---+---+---+---+---+
|0,0|1,0|2,0|3,0|4,0|6,0|8,0|9,1|
+---+---+---+---+---+---+---+---+
|4,0|2,1|2,0|1,0|1,0|0,1|3,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|4,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|5,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|6,0|1,0|
+---+---+---+---+---+---+---+---+

如何在spark上添加最后一列？ 我應該把它寫在csv文件上嗎？

Answer 1

這是一種在不更改CSV文件的情況下執行此操作的方法，您可以在代碼中設置架構：

val schema = StructType(
    Array(
        StructField("0,0", StringType),
        StructField("1,0", StringType),
        StructField("2,0", StringType),
        StructField("3,0", StringType),
        StructField("4,0", StringType),
        StructField("6,0", StringType),
        StructField("8,0", StringType),
        StructField("9,1", StringType), 
        StructField("X", StringType)
    )
)

val rawdf = 
    spark.read.format("csv")
        .option("header", "true")
        .option("delimiter", ";")
        .schema(schema)
        .load("tmp.csv")

Answer 2

Spark會根據您設置的標題列的可用數量來嘗試映射數據列：

.option("header", "true")

您可以使用以下兩種方法之一解決此問題：

設置header = false
添加最后一個數據列的標題列，或者只在標題行的末尾添加分號（;）。

例如：

0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1;
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"

要么

0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1;col_end
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"

Answer 3

如果您不知道數據行的長度，那么您可以將其讀作rdd ，進行一些解析，然后創建一個模式以形成一個dataframe ，如下所示

//read the data as rdd and split the lines 
val rddData = spark.sparkContext.textFile(CSVPATH)
    .map(_.split(";", -1))

//getting the max length from data and creating the schema
val maxlength = rddData.map(x => (x, x.length)).map(_._2).max
val schema = StructType((1 to maxlength).map(x => StructField(s"col_${x}", StringType, true)))

//parsing the data with the maxlength and populating null where no data and using the schema to form dataframe
val rawdf = spark.createDataFrame(rddData.map(x => Row.fromSeq((0 to maxlength-1).map(index => Try(x(index)).getOrElse("null")))), schema)

rawdf.show(false)

哪個應該給你

+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|0,0  |1,0  |2,0  |3,0  |4,0  |6,0  |8,0  |9,1  |null |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |3,0  |1,0  |"BC" |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |4,0  |1,0  |"BC" |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |5,0  |1,0  |"BC" |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |6,0  |1,0  |"BC" |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+

我希望答案是有幫助的

在讀取csv時，Spark將列添加到數據框

問題描述

3 個解決方案

解決方案1
3 已采納 2018-08-22 08:32:49

解決方案2
0 2018-08-22 08:09:59

解決方案3
0 2018-08-22 08:45:01

在讀取csv時，Spark將列添加到數據框

問題描述

3 個解決方案

解決方案1 3 已采納 2018-08-22 08:32:49

解決方案2 0 2018-08-22 08:09:59

解決方案3 0 2018-08-22 08:45:01

解決方案1
3 已采納 2018-08-22 08:32:49

解決方案2
0 2018-08-22 08:09:59

解決方案3
0 2018-08-22 08:45:01