简体繁体 English

Spark 2.1.1-无法从AWS S3读取JSON文件

[英]Spark 2.1.1 - unable to read JSON file from aws s3

原文 2017-06-18 12:19:12 6 1 scala/ amazon-web-services/ apache-spark/ amazon-s3

My Spark (v2.1.1) ETL application ( everything runs on my laptop ) reads from AWS S3 bucket (testing with only one JSON file), does some transformation, and loads it back to another AWS S3 bucket. 我的Spark（v2.1.1）ETL应用程序（ everything runs on my laptop ）从AWS S3存储桶读取（仅使用一个JSON文件进行测试），进行一些转换，然后将其加载回另一个AWS S3存储桶。 Problem is ... it is going in an infinite loop reading just one single JSON file which is 175 lines with one JSON record/line! 问题是...它正在infinite loop仅读取一个JSON文件（一行175条，每条JSON记录/行）！

Here's the logs: 这是日志：

17/06/18 08:09:05 INFO S3AFileSystem: Opening 's3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json' for reading 17/06/18 08:09:05 INFO S3AFileSystem: Getting path status for s3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json (481dba58a88d43ca67c2e4cc4933d7e6.json) 17/06/18 08:09:05 INFO S3AFileSystem: Reopening 481dba58a88d43ca67c2e4cc4933d7e6.json to seek to new offset 918 17/06/18 08:09:05 INFO S3AFileSystem: Actually opening file 481dba58a88d43ca67c2e4cc4933d7e6.json at pos 918 17/06/18 08:09:05 INFO Executor: Finished task 915.0 in stage 0.0 (TID 915). 1286 bytes result sent to driver 17/06/18 08:09:05 INFO TaskSetManager: Starting task 920.0 in stage 0.0 (TID 920, localhost, executor driver, partition 920, PROCESS_LOCAL, 6100 bytes) 17/06/18 08:09:05 INFO Executor: Running task 920.0 in stage 0.0 (TID 920) 17/06/18 08:09:05 INFO TaskSetManager: Finished task 915.0 in stage 0.0 (TID 915) in 125 ms on localhost (executor driver) (917/235798) 17/06/18 08:09:05 INFO HadoopRDD: Input split: s3a://minitest/481dba58a88d43ca67c2e4cc4933d7e6.json:920+1

this pattern keeps repeating ... endlessly ... and I've to eventually ctrl-z it! 这种模式不断重复...永无止境...我最终必须按ctrl-z ！

One observation ... when I open the Spark GUI, it says it has like 235,798 tasks! 一个观察...当我打开Spark GUI时，它说它有235,798个任务！ Don't know why so many tasks to be run! 不知道为什么要运行这么多任务！

I tested this with the JSON file deployed on my local machine, and it completes in less than a minute! 我使用部署在本地计算机上的JSON文件进行了测试，该文件不到一分钟即可完成！ Question is ... what is happening? 问题是...发生了什么事？ I set my AWS credentials before running, so it is not an authentication issue. 我在运行之前设置了我的AWS凭证，因此这不是身份验证问题。 Any ideas? 有任何想法吗？ Thanks! 谢谢！

Update: The Spark GUI says 235,798 tasks, which matches the number of characters in the JSON file! 更新：Spark GUI说出235,798个任务，与JSON文件中的字符数匹配！ But why is it reading character by character? 但是为什么要逐字阅读呢？ And what is the way around it? 怎样解决呢？

BTW, here's how I read the JSON file from AWS S3: 顺便说一句，这是我从AWS S3读取JSON文件的方式：

val df: DataFrame = spark.read.json("s3a://minitest/")

1 个解决方案

This is a bug ( HADOOP-11584 ) with the S3A client and Hadoop 2.6; 这是S3A客户端和Hadoop 2.6的错误（ HADOOP-11584 ）； you need to either upgrade to the Hadoop 2.7 binaries (best) or use s3n:// until you do 您需要升级到Hadoop 2.7二进制文件（最佳）或使用s3n：//，直到执行