简体   繁体   中英

How to handle NullType in Spark Dataframe using Python?

I'm trying to load data from MapR DB into Spark DF. Then I'm just trying to export the DF to CSV files. But, I'm getting error is:

"com.mapr.db.spark.exceptions.SchemaMappingException: Failed to parse a value for data type NullType (current token: STRING)"

I tried couple of ways by casting the column to StringType. This is one of them:

df = spark.loadFromMapRDB(db_table).select(
F.col('c_002.v_22').cast(T.StringType()).alias('aaa'),
F.col('c_002.v_23').cast(T.StringType()).alias('bbb')
)

print(df.printSchema())

Output of PrintSchema:

root
 |-- aaa: string (nullable = true)
 |-- bbb: string (nullable = true)

Values in column 'aaa' & 'bbb' can be null. Then I'm trying to export the df to CSV files:

df = df.repartition(10)
df.write.csv(csvFile, compression='gzip', mode='overwrite', sep=',', header='true', quoteAll='true')

I was getting a samilar issue with a MapR-DB JSON table and I was able to resolve by defining the table schema when loading into a DataFrame:

tableSchema = StructType([
    StructField("c_002.v_22", StringType(), True), # True here signifies nullable: https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=structfield#pyspark.sql.types.StructField
    StructField("c_002.v_23", StringType(), True),
])

df = spark.loadFromMapRDB(db_table, tableSchema ).select(
F.col('c_002.v_22').alias('aaa'),
F.col('c_002.v_23').alias('bbb')
)

Another thing you could try is simply filling the null values with something: https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna

df = df.na.fill('null')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM