简体   繁体   English

从csv读取数据返回空值

[英]Reading data from csv returns null values

I am trying to read data from csv using Scala and Spark but the values of columns are null. 我正在尝试使用Scala和Spark从csv读取数据,但是列的值为null。

I tried to read data from csv. 我试图从csv读取数据。 I also provided a schema for querying the data easily. 我还提供了用于轻松查询数据的架构。

private val myData= sparkSession.read.schema(createDataSchema).csv("data/myData.csv")
def createDataSchema = {
    val schema = StructType(
      Array(
        StructField("data_index",StringType, nullable = false),
        StructField("property_a",IntegerType, nullable = false),
        StructField("property_b",IntegerType, nullable = false),
        //some other columns
     )
   )

   schema

Querying data: 查询数据:

val myProperty= accidentData.select($"property_b")
myProperty.collect()

I expect that the data are returned as a List of certain values 我希望数据以某些值的列表形式返回

but they are returned as a list containing null values (values are null). 但是它们以包含空值的列表的形式返回(值是空的)。 Why? 为什么?

When I print the schema then nullable is set to true instead of false. 当我打印架构时,nullable设置为true而不是false。

I am using Scala 2.12.9 and Spark 2.4.3. 我正在使用Scala 2.12.9和Spark 2.4.3。

While loading data from CSV file though schema has been provided as nullable = false, Still Spark overwrites schema as nullable = true, so that null pointer could be avoided during data load. 尽管通过将模式作为nullable = false提供了从CSV文件加载数据的方式,Still Spark仍将模式覆盖为nullable = true,以便可以在数据加载期间避免使用空指针。

Let us take an example, let's assume CSV file has two rows with second-row has an empty or null column value. 让我们举个例子,假设CSV文件有两行,第二行的列值为空或为空。

CSV:
a,1,2
b,,2

If nullable = false, a null pointer exception would be thrown while loading data when an action has called on the data frame, as there is empty/null value to be loaded & there is no default value a Null pointer is thrown. 如果nullable = false,则在调用数据帧上的操作时,在加载数据时将抛出空指针异常,因为有空/空值要加载且没有默认值,因此会抛出空指针。 So to avoid it Spark overwrites it as nullable = true. 因此,为了避免这种情况,Spark将其覆盖为nullable = true。

However, this could be handled by replacing all null with a default value and then re-applying schema. 但是,可以通过将所有null替换为默认值,然后重新应用架构来解决。

val df = spark.read.schema(schema).csv("data/myData.csv")
val dfWithDefault = df.withColumn("property_a", when(col("property_a").isNull, 0).otherwise(df.col("property_a")))
val dfNullableFalse = spark.sqlContext.createDataFrame(dfWithDefault.rdd, schema)
dfNullableFalse.show(10)

df.printSchema()
root
|-- data_index: string (nullable = true)
|-- property_a: integer (nullable = true)
|-- property_b: integer (nullable = true)

dfNullableFalse.printSchema()
root
|-- data_index: string (nullable = false)
|-- property_a: integer (nullable = false)
|-- property_b: integer (nullable = false)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM