简体   繁体   English

pyspark dataframe 架构,无法为 csv 文件设置可为空的 false

[英]pyspark dataframe schema, not able to set nullable false for csv files

I am trying to load csv file using pyspark. I am giving my own schema with columns nullable false, still when I print schema it shows them true.我正在尝试使用 pyspark 加载 csv 文件。我给自己的架构提供了可为空的列,但当我打印架构时,它仍然显示它们为真。 I checked the file data, there are no null entries for columns which are nullable false.我检查了文件数据,没有 null 列的条目可以为 nullable false。

Code代码

from pyspark.sql.types import *

udemy_comments_file = '/Users/harbeerkadian/Documents/workspace/learn-spark/source_data/udemy/comments_spark.csv'
schema = StructType([StructField("id",StringType(),False),
                             StructField("course_id",StringType(),True),
                             StructField("rate",DoubleType(),True),
                             StructField("date",TimestampType(),True),
                             StructField("display_name",StringType(),True),
                             StructField("comment",StringType(),True),
                    StructField("new_id",StringType(),True)])
comments_df = spark.read.format('csv').option('header', 'true').schema(schema).load(udemy_comments_file)
comments_df.printSchema()
print("non null record count for id", comments_df.filter(comments_df.id.isNull()).count())

output output

root
 |-- id: string (nullable = true)
 |-- course_id: string (nullable = true)
 |-- rate: double (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- display_name: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- new_id: string (nullable = true)

non null record count for id 0

Ideally the id column nullable property should be false, as there are zero non null records.理想情况下,id 列的可空属性应该为 false,因为有零个非 null 记录。

Can you try to break the statement like below and load the data after assigning schema output to a new variable:在将模式 output 分配给新变量后,您能否尝试打破如下语句并加载数据:

csv_reader = spark.read.format('csv').option('header', 'true')
comments_df = csv_reader.schema(schema).load(udemy_comments_file)
comments_df.printSchema()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM