简体繁体 English

AWS EMR 笔记本 Spark 内核无限加载小型 JSON 文件

[英]AWS EMR notebook Spark kernel infinitely loads small JSON file

原文 2020-02-26 20:46:48 9 1 json/ scala/ apache-spark

I am trying to load a JSON file in an EMR notebook with a Spark kernel.我正在尝试使用 Spark 内核在 EMR 笔记本中加载 JSON 文件。 I am using a very large, proven EMR cluster that I have worked with before, so the cluster size/computation power is not the issue.我正在使用我以前使用过的非常大的、经过验证的 EMR 集群，因此集群大小/计算能力不是问题。 The simple code below is enough to reproduce my issue:下面的简单代码足以重现我的问题：

val df = spark.read.json("s3a://src/main/resources/zipcodes.json")

Here is the JSON file I am trying to load.这是我尝试加载的 JSON 文件。 It is extremely small.它非常小。 https://raw.githubusercontent.com/spark-examples/spark-scala-examples/71d2db89ffb24db6f01eb1fa12286bfbb37c44c4/src/main/resources/zipcodes.json https://raw.githubusercontent.com/spark-examples/spark-scala-examples/71d2db89ffb24db6f01eb1fa12286bfbb37c44c4/src/main/resources/zipcodes.json

I let it run for 1 hour.我让它运行了1个小时。 In the bottom left corner, it says: Spark | Busy在左下角，它写着： Spark | Busy Spark | Busy and the circle in the top right is full, indicating that the kernel is working. Spark | Busy和右上角的圆圈是满的，表示内核正在工作。 However, the Spark Job Progress shows a Task Progress bar that never progresses.但是， Spark Job Progress显示一个永远不会进行的Task Progress栏。 Any advice?有什么建议吗？

1 个解决方案

The problem was not the JSON file.问题不在于 JSON 文件。 In an attempt to fix this issue, I merely cloned my problematic EMR cluster with the exact same steps/configuration, attached my EMR notebook to the clone and re-attempted the exact same code with the exact same file.为了解决这个问题，我只是用完全相同的步骤/配置克隆了我的有问题的 EMR 集群，将我的 EMR 笔记本附加到克隆并重新尝试使用完全相同的文件使用完全相同的代码。 It worked nearly instantaneously.它几乎立即生效。 The problem was with the original cluster although I do not know what the exact problem was.问题出在原始集群上，尽管我不知道确切的问题是什么。