[英]spark.read.json() taking extremely long to load data
What I've Tried我试过的
I have JSON data which comes from an API. I saved all the data into a single directory.我有来自 API 的 JSON 数据。我将所有数据保存到一个目录中。 Now I am trying to load this data into a spark dataframe, so I can do ETL on it.
现在我正在尝试将此数据加载到 spark dataframe 中,以便我可以对其进行 ETL。 The API returned fragmented data (per division).
API 返回了碎片数据(每个分区)。 Moreover, some divisions had little data and some had a lot.
此外,有些部门的数据很少,有些部门的数据很多。 The directory I'm trying to load data from looks as follows:
我尝试从中加载数据的目录如下所示:
* json_file_1 - 109MB
* json_file_2 - 2.2MB
* json_file_3 - 67MB
* json_file_4 - 105MB
* json_file_5 - **2GB**
* json_file_6 - 15MB
* json_file_7 - 265MB
* json_file_8 - 35MB
* json_file_9 - **500KB**
* json_file_10 - 383MB
I'm using Azure Synapse and an Apache Spark Pool, the data directory i'm loading from resides in an ADLS2 data lake.我正在使用 Azure Synapse 和 Apache Spark Pool,我从中加载的数据目录位于 ADLS2 数据湖中。 I'm using the following code to load all data files that reside in the directory.
我正在使用以下代码加载目录中的所有数据文件。 For other projects this code works fine and fast.
对于其他项目,此代码运行良好且快速。
blob_path_raw = 'abfss://blob_container@my_data_lake.dfs.core.windows.net'
df = spark.read.json(path=f"{blob_path_raw}/path_to_directory_described_above")
My Question我的问题
The code above is taking extremely long to run (at the time of writing already more than 3 hours), and I suspect it got stuck somewhere, as loading +-4GB of data is something a Spark pool should be easily able to do.上面的代码运行时间非常长(在撰写本文时已经超过 3 小时),我怀疑它卡在了某个地方,因为加载 +-4GB 的数据是 Spark 池应该很容易做到的事情。 I suspect something is going wrong in Spark because of the heterogenous sizes of data files.
我怀疑 Spark 出现问题是因为数据文件的大小不同。 But I am still rather novice in Spark as we just migrated to Azure Synapse.
但我在 Spark 中仍然是新手,因为我们刚刚迁移到 Azure Synapse。 What is going wrong here, and how do I debug it?
这里出了什么问题,我该如何调试它?
I found the problem.我发现了问题。 9 out of 10 files were JSON Line format, so every line was a valid JSON object. Example below:
10 个文件中有 9 个是 JSON 行格式,因此每一行都是有效的 JSON object。示例如下:
{"name": "Gilbert", "wins": [["straight", "7♣"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
While the big file ( I got it through another way than the other 9 files) was a regular JSON file format, where every JSON object was an array element, and every JSON object was spanning multiple lines:虽然大文件(我通过另一种方式获得它而不是其他 9 个文件)是常规的 JSON 文件格式,其中每个 JSON object 是一个数组元素,每个 JSON object 跨越多行:
[
{
"name": "Gilbert",
"wins": [
[
"straight",
"7♣"
]
]
},
{
"name": "Alexa",
"wins": [
[
"two pair",
"4♠"
],
[
"two pair",
"9♠"
]
]]
As per Spark documentation (https://spark.apache.org/docs/latest/sql-data-sources-json.html ) one needs to specify multiline=true
option to read multiline JSON. When not specifying this option, Spark expects that each line in your JSON file contains a seperate, self-contained, valid JSON object. Therefore, mixing multiline JSON and JSONL files is a bad idea/impossible.根据 Spark 文档 (https://spark.apache.org/docs/latest/sql-data-sources-json.html ),需要指定
multiline=true
选项才能读取多行 JSON。当未指定此选项时,Spark 期望JSON 文件中的每一行都包含一个单独的、独立的、有效的 JSON object。因此,混合多行 JSON 和 JSONL 文件是一个坏主意/不可能。
TLDR Solution Changing the big file from multiline JSON to JSONL fixed the problem and dataframe is now loading fine. TLDR 解决方案将大文件从多行 JSON 更改为 JSONL 解决了问题,dataframe 现在可以正常加载。 I used the following code to do this:
我使用以下代码来执行此操作:
import json
with open("multilines_json_file.json", "r") as f:
python_obj = json.load(f)
def dump_jsonl(data, output_path, append=False):
"""
Write list of objects to a JSON lines file.
"""
mode = 'a+' if append else 'w'
with open(output_path, mode, encoding='utf-8') as f:
for line in data:
json_record = json.dumps(line, ensure_ascii=False)
f.write(json_record + '\n')
print('Wrote {} records to {}'.format(len(data), output_path))
dump_jsonl(python_obj, 'multilines_json_file.json', append=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.