从csv读取数据返回空值

Question

I am trying to read data from csv using Scala and Spark but the values of columns are null. 我正在尝试使用Scala和Spark从csv读取数据，但是列的值为null。

I tried to read data from csv. 我试图从csv读取数据。 I also provided a schema for querying the data easily. 我还提供了用于轻松查询数据的架构。

private val myData= sparkSession.read.schema(createDataSchema).csv("data/myData.csv")

def createDataSchema = {
    val schema = StructType(
      Array(
        StructField("data_index",StringType, nullable = false),
        StructField("property_a",IntegerType, nullable = false),
        StructField("property_b",IntegerType, nullable = false),
        //some other columns
     )
   )

   schema

Querying data: 查询数据：

val myProperty= accidentData.select($"property_b")
myProperty.collect()

I expect that the data are returned as a List of certain values 我希望数据以某些值的列表形式返回

but they are returned as a list containing null values (values are null). 但是它们以包含空值的列表的形式返回（值是空的）。 Why? 为什么？

When I print the schema then nullable is set to true instead of false. 当我打印架构时，nullable设置为true而不是false。

I am using Scala 2.12.9 and Spark 2.4.3. 我正在使用Scala 2.12.9和Spark 2.4.3。

Answer 1

While loading data from CSV file though schema has been provided as nullable = false, Still Spark overwrites schema as nullable = true, so that null pointer could be avoided during data load. 尽管通过将模式作为nullable = false提供了从CSV文件加载数据的方式，Still Spark仍将模式覆盖为nullable = true，以便可以在数据加载期间避免使用空指针。

Let us take an example, let's assume CSV file has two rows with second-row has an empty or null column value. 让我们举个例子，假设CSV文件有两行，第二行的列值为空或为空。

CSV:
a,1,2
b,,2

If nullable = false, a null pointer exception would be thrown while loading data when an action has called on the data frame, as there is empty/null value to be loaded & there is no default value a Null pointer is thrown. 如果nullable = false，则在调用数据帧上的操作时，在加载数据时将抛出空指针异常，因为有空/空值要加载且没有默认值，因此会抛出空指针。 So to avoid it Spark overwrites it as nullable = true. 因此，为了避免这种情况，Spark将其覆盖为nullable = true。

However, this could be handled by replacing all null with a default value and then re-applying schema. 但是，可以通过将所有null替换为默认值，然后重新应用架构来解决。

val df = spark.read.schema(schema).csv("data/myData.csv")
val dfWithDefault = df.withColumn("property_a", when(col("property_a").isNull, 0).otherwise(df.col("property_a")))
val dfNullableFalse = spark.sqlContext.createDataFrame(dfWithDefault.rdd, schema)
dfNullableFalse.show(10)

df.printSchema()
root
|-- data_index: string (nullable = true)
|-- property_a: integer (nullable = true)
|-- property_b: integer (nullable = true)

dfNullableFalse.printSchema()
root
|-- data_index: string (nullable = false)
|-- property_a: integer (nullable = false)
|-- property_b: integer (nullable = false)

从csv读取数据返回空值

问题描述

1 个解决方案

解决方案1
0 2019-09-11 17:38:58

从csv读取数据返回空值

问题描述

1 个解决方案

解决方案1 0 2019-09-11 17:38:58

解决方案1
0 2019-09-11 17:38:58