[英]Spark add column to dataframe when reading csv
我有一個csv,其數據形狀如下:
0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"
我想將它轉換為一個數據框,最后一列名為“value”。 我已經在Scala中編寫了這段代碼:
val rawdf = spark.read.format("csv")
.option("header", "true")
.option("delimiter", ";")
.load(CSVPATH)
但是我用rawdf.show(numRows = 4)
獲得了這個結果:
+---+---+---+---+---+---+---+---+
|0,0|1,0|2,0|3,0|4,0|6,0|8,0|9,1|
+---+---+---+---+---+---+---+---+
|4,0|2,1|2,0|1,0|1,0|0,1|3,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|4,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|5,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|6,0|1,0|
+---+---+---+---+---+---+---+---+
如何在spark上添加最后一列? 我應該把它寫在csv文件上嗎?
這是一種在不更改CSV文件的情況下執行此操作的方法,您可以在代碼中設置架構:
val schema = StructType(
Array(
StructField("0,0", StringType),
StructField("1,0", StringType),
StructField("2,0", StringType),
StructField("3,0", StringType),
StructField("4,0", StringType),
StructField("6,0", StringType),
StructField("8,0", StringType),
StructField("9,1", StringType),
StructField("X", StringType)
)
)
val rawdf =
spark.read.format("csv")
.option("header", "true")
.option("delimiter", ";")
.schema(schema)
.load("tmp.csv")
Spark會根據您設置的標題列的可用數量來嘗試映射數據列:
.option("header", "true")
您可以使用以下兩種方法之一解決此問題:
例如:
0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1;
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"
要么
0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1;col_end
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"
如果您不知道數據行的長度,那么您可以將其讀作rdd
,進行一些解析 ,然后創建一個模式以形成一個dataframe
,如下所示
//read the data as rdd and split the lines
val rddData = spark.sparkContext.textFile(CSVPATH)
.map(_.split(";", -1))
//getting the max length from data and creating the schema
val maxlength = rddData.map(x => (x, x.length)).map(_._2).max
val schema = StructType((1 to maxlength).map(x => StructField(s"col_${x}", StringType, true)))
//parsing the data with the maxlength and populating null where no data and using the schema to form dataframe
val rawdf = spark.createDataFrame(rddData.map(x => Row.fromSeq((0 to maxlength-1).map(index => Try(x(index)).getOrElse("null")))), schema)
rawdf.show(false)
哪個應該給你
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|0,0 |1,0 |2,0 |3,0 |4,0 |6,0 |8,0 |9,1 |null |
|4,0 |2,1 |2,0 |1,0 |1,0 |0,1 |3,0 |1,0 |"BC" |
|4,0 |2,1 |2,0 |1,0 |1,0 |0,1 |4,0 |1,0 |"BC" |
|4,0 |2,1 |2,0 |1,0 |1,0 |0,1 |5,0 |1,0 |"BC" |
|4,0 |2,1 |2,0 |1,0 |1,0 |0,1 |6,0 |1,0 |"BC" |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
我希望答案是有幫助的
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.