简体   繁体   中英

Apache spark: reading from csv files and creating RDDes

I have two files: airports.csv and flights.csv. Airports have columns: IATA_CODE AIRPORT CITY STATE COUNTRY LATITUDE LONGITUDE. Flights have columns: YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY.

I read the files in:

val airports = sc.textFile("./archive/airports_b.csv")
val flights = sc.textFile("./archive/flights_b.csv")

Created RDDes, followed instructions in different websites:

val airportRDD: RDD[(VertexId, (String))] = airports.map { line => 
  val row = line split ','
  (row(1).toLong, (row(2))) //1 IATA code, 2 - Airport name
}

val flightsRDD: RDD[Edge[String]] = flights.map {line => 
val row = line split ','
Edge(row(7).toLong, row(8).toLong, row(17)) // 7 Original Airport, 8 Destination Airport, 17 Distance
}

val graph = Graph(airportRDD, flightsRDD)

My next step is to just take first three samples:

println("Airports: " + airportRDD.take(3))
println("Flights: "+ flightsRDD.take(3))

But I am getting following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 1 times, most recent failure: Lost task 0.0 in stage 25.0 (TID 58) (host.docker.internal executor driver): java.lang.NumberFormatException: For input string: "ABE"

Could someone advise what's wrong in the code?

Indexing in scala is zero based - first column is row(0) instead of row(1) and so on. Besides it is easier to use spark.read.option("headers",true).csv(hdfs_path) lo load csv file i/o parsing it manually. If headers are not present, then you don't need option("headers",true) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM