I'm trying to read in flight data from the Department of Transportation. It is stored in a CSV, and keep getting java.lang.NumberFormatException: null
I have tried setting the nanValue
to the empty string, as it's default value is NaN
, but this hasn't worked.
My current code is:
spark = SparkSession.builder \
.master('local') \
.appName('Flight Delay') \
.getOrCreate()
schema = StructType([
StructField('Year', IntegerType(), nullable=True),
StructField('Month', IntegerType(), nullable=True),
StructField('Day', IntegerType(), nullable=True),
StructField('Dow', IntegerType(), nullable=True),
StructField('CarrierId', StringType(), nullable=True),
StructField('Carrier', StringType(), nullable=True),
StructField('TailNum', StringType(), nullable=True),
StructField('Origin', StringType(), nullable=True),
StructField('Dest', StringType(), nullable=True),
StructField('CRSDepTime', IntegerType(), nullable=True),
StructField('DepTime', IntegerType(), nullable=True),
StructField('DepDelay', DoubleType(), nullable=True),
StructField('TaxiOut', DoubleType(), nullable=True),
StructField('TaxiIn', DoubleType(), nullable=True),
StructField('CRSArrTime', IntegerType(), nullable=True),
StructField('ArrTime', IntegerType(), nullable=True),
StructField('ArrDelay', DoubleType(), nullable=True),
StructField('Cancelled', DoubleType(), nullable=True),
StructField('CancellationCode', StringType(), nullable=True),
StructField('Diverted', DoubleType(), nullable=True),
StructField('CRSElapsedTime', DoubleType(), nullable=True),
StructField('ActualElapsedTime', DoubleType(), nullable=True),
StructField('AirTime', DoubleType(), nullable=True),
StructField('Distance', DoubleType(), nullable=True),
StructField('CarrierDelay', DoubleType(), nullable=True),
StructField('WeatherDelay', DoubleType(), nullable=True),
StructField('NASDelay', DoubleType(), nullable=True),
StructField('SecurityDelay', DoubleType(), nullable=True),
StructField('LateAircraftDelay', DoubleType(), nullable=True)
])
flts = spark.read \
.format('com.databricks.spark.csv') \
.csv('/home/william/Projects/flight-delay/data/201601.csv',
schema=schema, nanValue='', header='true')
Here is the CSV I'm working with: http://pastebin.com/waahrgqB
The last row there is where it breaks and raises the java.lang.NumberFormatException: null
It seems that some numeric columns are empty strings, while others are just empty. Can someone please help me with this?
Thanks to KiranM's suggestion, I was able to find a solution. I let Spark infer the schema(everything is set as a String), and then manually set the columns I want to be numeric.
Here is the code:
from pyspark.sql import (SQLContext,
SparkSession)
from pyspark.sql.types import (StructType,
StructField,
DoubleType,
IntegerType,
StringType)
spark = SparkSession.builder \
.master('local') \
.appName('Flight Delay') \
.getOrCreate()
flts = spark.read \
.format('com.databricks.spark.csv') \
.csv('/home/william/Projects/flight-delay/data/merged/2016.csv',
inferSchema='true', nanValue="", header='true', mode='PERMISSIVE')
flts = flts \
.withColumn('Year', flts['Year'].cast('int')) \
.withColumn('Month', flts['Month'].cast('int')) \
.withColumn('Day', flts['Day'].cast('int')) \
.withColumn('Dow', flts['Dow'].cast('int')) \
.withColumn('CRSDepTime', flts['CRSDepTime'].cast('int')) \
.withColumn('DepTime', flts['DepTime'].cast('int')) \
.withColumn('DepDelay', flts['DepDelay'].cast('int')) \
.withColumn('TaxiOut', flts['TaxiOut'].cast('int')) \
.withColumn('TaxiIn', flts['TaxiIn'].cast('int')) \
.withColumn('CRSArrTime', flts['CRSArrTime'].cast('int')) \
.withColumn('ArrTime', flts['ArrTime'].cast('int')) \
.withColumn('ArrDelay', flts['ArrDelay'].cast('int')) \
.withColumn('Cancelled', flts['Cancelled'].cast('int')) \
.withColumn('Diverted', flts['Diverted'].cast('int')) \
.withColumn('CRSElapsedTime', flts['CRSElapsedTime'].cast('int')) \
.withColumn('ActualElapsedTime', flts['ActualElapsedTime'].cast('int')) \
.withColumn('AirTime', flts['AirTime'].cast('int')) \
.withColumn('Distance', flts['Distance'].cast('int')) \
.withColumn('CarrierDelay', flts['CarrierDelay'].cast('int')) \
.withColumn('WeatherDelay', flts['WeatherDelay'].cast('int')) \
.withColumn('NASDelay', flts['NASDelay'].cast('int')) \
.withColumn('SecurityDelay', flts['SecurityDelay'].cast('int')) \
.withColumn('LateAircraftDelay ', flts['LateAircraftDelay '].cast('int'))
Maybe I could put that into a loop, but I'm going to run with this for now.
The issue is with the numeric type column having an empty string (with "" instead of blank data).
Then one option is to read the data as StringType column, then convert that column type to your relevant type (ex: int). So that it wouldn't impact other column data.
StructField('CRSDepTime', StringType(), nullable=True),
flts.withColumn('CRSDepTime', flts['CRSDepTime'].cast("int")) \
.printSchema()
This should solve your problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.