spark dataframe 正在從 csv 文件加載所有空值

Question

我有一個包含以下數據的文件

####$ cat products.csv 
1,tv,sony,hd,699
2,tv,sony,uhd,799
3,tv,samsung,hd,599
4,tv,samsung,uhd,799
5,phone,iphone,x,999
6,phone,iphone,11,999
7,phone,samsung,10,899
8,phone,samsung,10note,999
9,phone,pixel,4,799
10,phone,pixel,3,699

我試圖將其加載到 spark dataframe 中，它沒有給我任何錯誤，但它正在加載所有空值。

scala> val productSchema = StructType((Array(StructField("productId",IntegerType,true),StructField("productType",IntegerType,true),StructField("company",IntegerType,true),StructField("model",IntegerType,true),StructField("price",IntegerType,true))))
productSchema: org.apache.spark.sql.types.StructType = StructType(StructField(productId,IntegerType,true), StructField(productType,IntegerType,true), StructField(company,IntegerType,true), StructField(model,IntegerType,true), StructField(price,IntegerType,true))

scala> val df = spark.read.format("csv").option("header", "false").schema(productSchema).load("/path/products_js/products.csv")
df: org.apache.spark.sql.DataFrame = [productId: int, productType: int ... 3 more fields]

scala> df.show
+---------+-----------+-------+-----+-----+
|productId|productType|company|model|price|
+---------+-----------+-------+-----+-----+
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
+---------+-----------+-------+-----+-----+

現在我嘗試了一種不同的方式來加載數據並且它有效

scala> val temp = spark.read.csv("/path/products_js/products.csv")
temp: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 3 more fields]

scala> temp.show
+---+-----+-------+------+---+
|_c0|  _c1|    _c2|   _c3|_c4|
+---+-----+-------+------+---+
|  1|   tv|   sony|    hd|699|
|  2|   tv|   sony|   uhd|799|
|  3|   tv|samsung|    hd|599|
|  4|   tv|samsung|   uhd|799|
|  5|phone| iphone|     x|999|
|  6|phone| iphone|    11|999|
|  7|phone|samsung|    10|899|
|  8|phone|samsung|10note|999|
|  9|phone|  pixel|     4|799|
| 10|phone|  pixel|     3|699|
+---+-----+-------+------+---+

在第二種方法中，它加載了數據，但我無法將該方案添加到 dataframe。 兩種加載數據的方法有什么區別，為什么第一種方法加載 null？ 誰能幫我

Answer 1

您首先將列的字符串類型定義為錯誤的整數類型。 這是有效的，

import org.apache.spark.sql.types.{StructType, IntegerType, StringType}

val productSchema = new StructType()
                        .add("productId", "int")
                        .add("productType", "string")
                        .add("company", "string")
                        .add("model", "string")
                        .add("price", "int")

val df = spark.read.format("csv")
            .option("header", "false")
            .schema(productSchema)
            .load("test.csv")

df.show()

結果是

+---------+-----------+-------+------+-----+
|productId|productType|company| model|price|
+---------+-----------+-------+------+-----+
|        1|         tv|   sony|    hd|  699|
|        2|         tv|   sony|   uhd|  799|
|        3|         tv|samsung|    hd|  599|
|        4|         tv|samsung|   uhd|  799|
|        5|      phone| iphone|     x|  999|
|        6|      phone| iphone|    11|  999|
|        7|      phone|samsung|    10|  899|
|        8|      phone|samsung|10note|  999|
|        9|      phone|  pixel|     4|  799|
|       10|      phone|  pixel|     3|  699|
+---------+-----------+-------+------+-----+

spark dataframe 正在從 csv 文件加載所有空值

問題描述

1 個解決方案

解決方案1
3 已采納 2019-10-27 06:36:58

spark dataframe 正在從 csv 文件加載所有空值

問題描述

1 個解決方案

解決方案1 3 已采納 2019-10-27 06:36:58

解決方案1
3 已采納 2019-10-27 06:36:58