如何使用Array [Int]將csv文件加載到Spark DataFrame中

Question

我的csv文件中的每一行都是這樣構造的：

u001, 2013-11, 0, 1, 2, ... , 99

其中U001和2013-11是UID和日期，從0到99的數目是數據值。 我想在此結構中將此csv文件加載到Spark DataFrame中：

+-------+-------------+-----------------+
|    uid|         date|       dataVector|
+-------+-------------+-----------------+
|   u001|      2013-11|  [0,1,...,98,99]|
|   u002|      2013-11| [1,2,...,99,100]|
+-------+-------------+-----------------+

root
 |-- uid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- dataVecotr: array (nullable = true)
 |    |-- element: integer (containsNull = true)

其中dataVector是Array [Int] ，並且dataVector長度對於所有UID和日期都是相同的。 我已經嘗試了幾種方法來解決這個問題，包括

用shema

 val attributes = Array("uid", "date", "dataVector) val schema = StructType( StructField(attributes(0), StringType, true) :: StructField(attributes(1), StringType, true) :: StructField(attributes(2), ArrayType(IntegerType), true) :: Nil)

但這種方式效果不佳。 由於我后來的數據集中的數據列大於100，我認為手動創建包含dataVector整列的模式也很不方便。

直接加載沒有模式的csv文件，並使用該方法將多列連接成單列以將數據列連接在一起，但模式結構就像

  root |-- uid: string (nullable = true) |-- date: string (nullable = true) |-- dataVector: struct (nullable = true) | |-- _c3: string (containsNull = true) | |-- _c4: string (containsNull = true) . . . | |-- _c101: string (containsNull = true)

這仍然與我需要的不同，我沒有找到將這個結構轉換成我需要的方法。 所以我的問題是如何將csv文件加載到我需要的結構中？

Answer 1

加載它沒有任何添加

val df = spark.read.csv(path)

並選擇：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

// Combine data into array
val dataVector: Column = array(
  df.columns.drop(2).map(col): _*  // Skip first 2 columns
).cast("array<int>")  // Cast to the required type
val cols: Array[Column] = df.columns.take(2).map(col) :+ dataVector

df.select(cols: _*).toDF("uid", "date", "dataVector")

如何使用Array [Int]將csv文件加載到Spark DataFrame中

問題描述

1 個解決方案

解決方案1
2 已采納 2017-12-15 02:32:36

如何使用Array [Int]將csv文件加載到Spark DataFrame中

問題描述

1 個解決方案

解決方案1 2 已采納 2017-12-15 02:32:36

解決方案1
2 已采納 2017-12-15 02:32:36