How to handle NullType in Spark Dataframe using Python?

Question

I'm trying to load data from MapR DB into Spark DF. Then I'm just trying to export the DF to CSV files. But, I'm getting error is:

"com.mapr.db.spark.exceptions.SchemaMappingException: Failed to parse a value for data type NullType (current token: STRING)"

I tried couple of ways by casting the column to StringType. This is one of them:

df = spark.loadFromMapRDB(db_table).select(
F.col('c_002.v_22').cast(T.StringType()).alias('aaa'),
F.col('c_002.v_23').cast(T.StringType()).alias('bbb')
)

print(df.printSchema())

Output of PrintSchema:

root
 |-- aaa: string (nullable = true)
 |-- bbb: string (nullable = true)

Values in column 'aaa' & 'bbb' can be null. Then I'm trying to export the df to CSV files:

df = df.repartition(10)
df.write.csv(csvFile, compression='gzip', mode='overwrite', sep=',', header='true', quoteAll='true')

Answer 1

I was getting a samilar issue with a MapR-DB JSON table and I was able to resolve by defining the table schema when loading into a DataFrame:

tableSchema = StructType([
    StructField("c_002.v_22", StringType(), True), # True here signifies nullable: https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=structfield#pyspark.sql.types.StructField
    StructField("c_002.v_23", StringType(), True),
])

df = spark.loadFromMapRDB(db_table, tableSchema ).select(
F.col('c_002.v_22').alias('aaa'),
F.col('c_002.v_23').alias('bbb')
)

Another thing you could try is simply filling the null values with something: https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna

df = df.na.fill('null')

How to handle NullType in Spark Dataframe using Python?

Question

1 answers

solution1
0 2019-07-24 06:12:06

How to handle NullType in Spark Dataframe using Python?

Question

1 answers

solution1 0 2019-07-24 06:12:06

solution1
0 2019-07-24 06:12:06