spark.read.json() 加载数据需要很长时间

Question

What I've Tried我试过的

I have JSON data which comes from an API. I saved all the data into a single directory.我有来自 API 的 JSON 数据。我将所有数据保存到一个目录中。 Now I am trying to load this data into a spark dataframe, so I can do ETL on it.现在我正在尝试将此数据加载到 spark dataframe 中，以便我可以对其进行 ETL。 The API returned fragmented data (per division). API 返回了碎片数据（每个分区）。 Moreover, some divisions had little data and some had a lot.此外，有些部门的数据很少，有些部门的数据很多。 The directory I'm trying to load data from looks as follows:我尝试从中加载数据的目录如下所示：

* json_file_1 - 109MB
* json_file_2 - 2.2MB
* json_file_3 - 67MB
* json_file_4 - 105MB
* json_file_5 - **2GB**
* json_file_6 - 15MB
* json_file_7 - 265MB
* json_file_8 - 35MB
* json_file_9 - **500KB**
* json_file_10 - 383MB

I'm using Azure Synapse and an Apache Spark Pool, the data directory i'm loading from resides in an ADLS2 data lake.我正在使用 Azure Synapse 和 Apache Spark Pool，我从中加载的数据目录位于 ADLS2 数据湖中。 I'm using the following code to load all data files that reside in the directory.我正在使用以下代码加载目录中的所有数据文件。 For other projects this code works fine and fast.对于其他项目，此代码运行良好且快速。

blob_path_raw = 'abfss://blob_container@my_data_lake.dfs.core.windows.net'
df = spark.read.json(path=f"{blob_path_raw}/path_to_directory_described_above")

My Question我的问题

The code above is taking extremely long to run (at the time of writing already more than 3 hours), and I suspect it got stuck somewhere, as loading +-4GB of data is something a Spark pool should be easily able to do.上面的代码运行时间非常长（在撰写本文时已经超过 3 小时），我怀疑它卡在了某个地方，因为加载 +-4GB 的数据是 Spark 池应该很容易做到的事情。 I suspect something is going wrong in Spark because of the heterogenous sizes of data files.我怀疑 Spark 出现问题是因为数据文件的大小不同。 But I am still rather novice in Spark as we just migrated to Azure Synapse.但我在 Spark 中仍然是新手，因为我们刚刚迁移到 Azure Synapse。 What is going wrong here, and how do I debug it?这里出了什么问题，我该如何调试它？

Answer 1

I found the problem.我发现了问题。 9 out of 10 files were JSON Line format, so every line was a valid JSON object. Example below: 10 个文件中有 9 个是 JSON 行格式，因此每一行都是有效的 JSON object。示例如下：

{"name": "Gilbert", "wins": [["straight", "7♣"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

While the big file ( I got it through another way than the other 9 files) was a regular JSON file format, where every JSON object was an array element, and every JSON object was spanning multiple lines:虽然大文件（我通过另一种方式获得它而不是其他 9 个文件）是常规的 JSON 文件格式，其中每个 JSON object 是一个数组元素，每个 JSON object 跨越多行：

[
  {
    "name": "Gilbert",
    "wins": [
      [
        "straight",
        "7♣"
      ]
    ]
  },
  {
    "name": "Alexa",
    "wins": [
      [
        "two pair",
        "4♠"
      ],
      [
        "two pair",
        "9♠"
      ]
    ]]

As per Spark documentation (https://spark.apache.org/docs/latest/sql-data-sources-json.html ) one needs to specify multiline=true option to read multiline JSON. When not specifying this option, Spark expects that each line in your JSON file contains a seperate, self-contained, valid JSON object. Therefore, mixing multiline JSON and JSONL files is a bad idea/impossible.根据 Spark 文档 (https://spark.apache.org/docs/latest/sql-data-sources-json.html )，需要指定multiline=true选项才能读取多行 JSON。当未指定此选项时，Spark 期望JSON 文件中的每一行都包含一个单独的、独立的、有效的 JSON object。因此，混合多行 JSON 和 JSONL 文件是一个坏主意/不可能。

TLDR Solution Changing the big file from multiline JSON to JSONL fixed the problem and dataframe is now loading fine. TLDR 解决方案将大文件从多行 JSON 更改为 JSONL 解决了问题，dataframe 现在可以正常加载。 I used the following code to do this:我使用以下代码来执行此操作：

import json

with open("multilines_json_file.json", "r") as f:
    python_obj = json.load(f)


def dump_jsonl(data, output_path, append=False):
    """
    Write list of objects to a JSON lines file.
    """
    mode = 'a+' if append else 'w'
    with open(output_path, mode, encoding='utf-8') as f:
        for line in data:
            json_record = json.dumps(line, ensure_ascii=False)
            f.write(json_record + '\n')
    print('Wrote {} records to {}'.format(len(data), output_path))


dump_jsonl(python_obj, 'multilines_json_file.json', append=True)

spark.read.json() 加载数据需要很长时间

问题描述

1 个解决方案

解决方案1
1 2023-01-17 11:05:45

spark.read.json() 加载数据需要很长时间

问题描述

1 个解决方案

解决方案1 1 2023-01-17 11:05:45

解决方案1
1 2023-01-17 11:05:45