简体繁体中英

Spark 2.1.1 - unable to read JSON file from aws s3

原文 2017-06-18 12:19:12 3 1 scala/ amazon-web-services/ apache-spark/ amazon-s3

My Spark (v2.1.1) ETL application ( everything runs on my laptop ) reads from AWS S3 bucket (testing with only one JSON file), does some transformation, and loads it back to another AWS S3 bucket. Problem is ... it is going in an infinite loop reading just one single JSON file which is 175 lines with one JSON record/line!

Here's the logs:

17/06/18 08:09:05 INFO S3AFileSystem: Opening 's3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json' for reading 17/06/18 08:09:05 INFO S3AFileSystem: Getting path status for s3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json (481dba58a88d43ca67c2e4cc4933d7e6.json) 17/06/18 08:09:05 INFO S3AFileSystem: Reopening 481dba58a88d43ca67c2e4cc4933d7e6.json to seek to new offset 918 17/06/18 08:09:05 INFO S3AFileSystem: Actually opening file 481dba58a88d43ca67c2e4cc4933d7e6.json at pos 918 17/06/18 08:09:05 INFO Executor: Finished task 915.0 in stage 0.0 (TID 915). 1286 bytes result sent to driver 17/06/18 08:09:05 INFO TaskSetManager: Starting task 920.0 in stage 0.0 (TID 920, localhost, executor driver, partition 920, PROCESS_LOCAL, 6100 bytes) 17/06/18 08:09:05 INFO Executor: Running task 920.0 in stage 0.0 (TID 920) 17/06/18 08:09:05 INFO TaskSetManager: Finished task 915.0 in stage 0.0 (TID 915) in 125 ms on localhost (executor driver) (917/235798) 17/06/18 08:09:05 INFO HadoopRDD: Input split: s3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json:920+1

this pattern keeps repeating ... endlessly ... and I've to eventually ctrl-z it!

One observation ... when I open the Spark GUI, it says it has like 235,798 tasks! Don't know why so many tasks to be run!

I tested this with the JSON file deployed on my local machine, and it completes in less than a minute! Question is ... what is happening? I set my AWS credentials before running, so it is not an authentication issue. Any ideas? Thanks!

Update: The Spark GUI says 235,798 tasks, which matches the number of characters in the JSON file! But why is it reading character by character? And what is the way around it?

BTW, here's how I read the JSON file from AWS S3:

val df: DataFrame = spark.read.json("s3a://minitest/")

1 answers

This is a bug ( HADOOP-11584 ) with the S3A client and Hadoop 2.6; you need to either upgrade to the Hadoop 2.7 binaries (best) or use s3n:// until you do

unable to read a CSV file present in AWS S3 folder locally in intelij using spark scala

How to read a text file in S3 bucket from inside an AWS EMR without using spark

How to read multiple files from AWS S3 in spark dataframe?

Unable to read from s3 bucket using spark

How to read a Map stored in s3 as a file as a json in spark?

Scala Spark Read from AWS S3 - com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

Which is the fastest way to read Json Files from S3 : Spark

Reading .conf file from AWS s3 through spark and scala

Spark :How to generate file path to read from s3 with scala

Read new s3 file paths from spark streaming

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question unable to read a CSV file present in AWS S3 folder locally in intelij using spark scala How to read a text file in S3 bucket from inside an AWS EMR without using spark How to read multiple files from AWS S3 in spark dataframe? Unable to read from s3 bucket using spark How to read a Map stored in s3 as a file as a json in spark? Scala Spark Read from AWS S3 - com.amazonaws.SdkClientException: Unable to load credentials from service endpoint Which is the fastest way to read Json Files from S3 : Spark Reading .conf file from AWS s3 through spark and scala Spark :How to generate file path to read from s3 with scala Read new s3 file paths from spark streaming

Related Tags

Spark 2.1.1 - unable to read JSON file from aws s3

Question

1 answers

solution1 1 ACCPTED 2017-06-19 10:42:59

solution1
1 ACCPTED 2017-06-19 10:42:59