简体   繁体   中英

Spark 2.1.1 - unable to read JSON file from aws s3

My Spark (v2.1.1) ETL application ( everything runs on my laptop ) reads from AWS S3 bucket (testing with only one JSON file), does some transformation, and loads it back to another AWS S3 bucket. Problem is ... it is going in an infinite loop reading just one single JSON file which is 175 lines with one JSON record/line!

Here's the logs:

17/06/18 08:09:05 INFO S3AFileSystem: Opening 's3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json' for reading 17/06/18 08:09:05 INFO S3AFileSystem: Getting path status for s3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json (481dba58a88d43ca67c2e4cc4933d7e6.json) 17/06/18 08:09:05 INFO S3AFileSystem: Reopening 481dba58a88d43ca67c2e4cc4933d7e6.json to seek to new offset 918 17/06/18 08:09:05 INFO S3AFileSystem: Actually opening file 481dba58a88d43ca67c2e4cc4933d7e6.json at pos 918 17/06/18 08:09:05 INFO Executor: Finished task 915.0 in stage 0.0 (TID 915). 1286 bytes result sent to driver 17/06/18 08:09:05 INFO TaskSetManager: Starting task 920.0 in stage 0.0 (TID 920, localhost, executor driver, partition 920, PROCESS_LOCAL, 6100 bytes) 17/06/18 08:09:05 INFO Executor: Running task 920.0 in stage 0.0 (TID 920) 17/06/18 08:09:05 INFO TaskSetManager: Finished task 915.0 in stage 0.0 (TID 915) in 125 ms on localhost (executor driver) (917/235798) 17/06/18 08:09:05 INFO HadoopRDD: Input split: s3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json:920+1

this pattern keeps repeating ... endlessly ... and I've to eventually ctrl-z it!

One observation ... when I open the Spark GUI, it says it has like 235,798 tasks! Don't know why so many tasks to be run!

I tested this with the JSON file deployed on my local machine, and it completes in less than a minute! Question is ... what is happening? I set my AWS credentials before running, so it is not an authentication issue. Any ideas? Thanks!

Update: The Spark GUI says 235,798 tasks, which matches the number of characters in the JSON file! But why is it reading character by character? And what is the way around it?

BTW, here's how I read the JSON file from AWS S3:

val df: DataFrame = spark.read.json("s3a://minitest/")

This is a bug ( HADOOP-11584 ) with the S3A client and Hadoop 2.6; you need to either upgrade to the Hadoop 2.7 binaries (best) or use s3n:// until you do

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM