简体   繁体   English

spark.read.json() 加载数据需要很长时间

[英]spark.read.json() taking extremely long to load data

What I've Tried我试过的

I have JSON data which comes from an API. I saved all the data into a single directory.我有来自 API 的 JSON 数据。我将所有数据保存到一个目录中。 Now I am trying to load this data into a spark dataframe, so I can do ETL on it.现在我正在尝试将此数据加载到 spark dataframe 中,以便我可以对其进行 ETL。 The API returned fragmented data (per division). API 返回了碎片数据(每个分区)。 Moreover, some divisions had little data and some had a lot.此外,有些部门的数据很少,有些部门的数据很多。 The directory I'm trying to load data from looks as follows:我尝试从中加载数据的目录如下所示:

* json_file_1 - 109MB
* json_file_2 - 2.2MB
* json_file_3 - 67MB
* json_file_4 - 105MB
* json_file_5 - **2GB**
* json_file_6 - 15MB
* json_file_7 - 265MB
* json_file_8 - 35MB
* json_file_9 - **500KB**
* json_file_10 - 383MB

I'm using Azure Synapse and an Apache Spark Pool, the data directory i'm loading from resides in an ADLS2 data lake.我正在使用 Azure Synapse 和 Apache Spark Pool,我从中加载的数据目录位于 ADLS2 数据湖中。 I'm using the following code to load all data files that reside in the directory.我正在使用以下代码加载目录中的所有数据文件。 For other projects this code works fine and fast.对于其他项目,此代码运行良好且快速。

blob_path_raw = 'abfss://blob_container@my_data_lake.dfs.core.windows.net'
df = spark.read.json(path=f"{blob_path_raw}/path_to_directory_described_above")

My Question我的问题

The code above is taking extremely long to run (at the time of writing already more than 3 hours), and I suspect it got stuck somewhere, as loading +-4GB of data is something a Spark pool should be easily able to do.上面的代码运行时间非常长(在撰写本文时已经超过 3 小时),我怀疑它卡在了某个地方,因为加载 +-4GB 的数据是 Spark 池应该很容易做到的事情。 I suspect something is going wrong in Spark because of the heterogenous sizes of data files.我怀疑 Spark 出现问题是因为数据文件的大小不同。 But I am still rather novice in Spark as we just migrated to Azure Synapse.但我在 Spark 中仍然是新手,因为我们刚刚迁移到 Azure Synapse。 What is going wrong here, and how do I debug it?这里出了什么问题,我该如何调试它?

I found the problem.我发现了问题。 9 out of 10 files were JSON Line format, so every line was a valid JSON object. Example below: 10 个文件中有 9 个是 JSON 行格式,因此每一行都是有效的 JSON object。示例如下:

{"name": "Gilbert", "wins": [["straight", "7♣"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

While the big file ( I got it through another way than the other 9 files) was a regular JSON file format, where every JSON object was an array element, and every JSON object was spanning multiple lines:虽然大文件(我通过另一种方式获得它而不是其他 9 个文件)是常规的 JSON 文件格式,其中每个 JSON object 是一个数组元素,每个 JSON object 跨越多行:

[
  {
    "name": "Gilbert",
    "wins": [
      [
        "straight",
        "7♣"
      ]
    ]
  },
  {
    "name": "Alexa",
    "wins": [
      [
        "two pair",
        "4♠"
      ],
      [
        "two pair",
        "9♠"
      ]
    ]]

As per Spark documentation (https://spark.apache.org/docs/latest/sql-data-sources-json.html ) one needs to specify multiline=true option to read multiline JSON. When not specifying this option, Spark expects that each line in your JSON file contains a seperate, self-contained, valid JSON object. Therefore, mixing multiline JSON and JSONL files is a bad idea/impossible.根据 Spark 文档 (https://spark.apache.org/docs/latest/sql-data-sources-json.html ),需要指定multiline=true选项才能读取多行 JSON。当未指定此选项时,Spark 期望JSON 文件中的每一行都包含一个单独的、独立的、有效的 JSON object。因此,混合多行 JSON 和 JSONL 文件是一个坏主意/不可能。

TLDR Solution Changing the big file from multiline JSON to JSONL fixed the problem and dataframe is now loading fine. TLDR 解决方案将大文件从多行 JSON 更改为 JSONL 解决了问题,dataframe 现在可以正常加载。 I used the following code to do this:我使用以下代码来执行此操作:

import json

with open("multilines_json_file.json", "r") as f:
    python_obj = json.load(f)


def dump_jsonl(data, output_path, append=False):
    """
    Write list of objects to a JSON lines file.
    """
    mode = 'a+' if append else 'w'
    with open(output_path, mode, encoding='utf-8') as f:
        for line in data:
            json_record = json.dumps(line, ensure_ascii=False)
            f.write(json_record + '\n')
    print('Wrote {} records to {}'.format(len(data), output_path))


dump_jsonl(python_obj, 'multilines_json_file.json', append=True)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 spark.read.json() 如何使用动态年份参数读取文件 - spark.read.json() how to read files using a dynamic year parameter Pyspark df.write 耗时极长(超过 24 小时) - Pyspark df.write taking extremely long (over 24 hours) AWS Application Load Balancing:看到极长的初始连接时间 - AWS Application Load Balancing: Seeing extremely long initial connection time 使用 AWS Glue ETL 将镶木地板文件从 S3 加载到 AWS RDS 需要很长时间 - Loading parquet file from S3 to AWS RDS taking extremely long time using AWS Glue ETL Spark RDD S3 saveAsTextFile 需要很长时间 - Spark RDD S3 saveAsTextFile taking long time 我如何在快速应用程序中加速 firebase 功能,这些功能加载时间太长,大多数时间超过 8 秒 - How can i speed up firebase functions in an express app which are taking too long to load taking more than 8 seconds most times 为什么部署在 Google 计算引擎 VM 实例中的网站加载时间过长? - Why my website that is deployed in Google compute engine VM instance taking too long to load? 无法使用 Apache Spark 在 AWS Glue 中读取 json 个文件 - Unable to read json files in AWS Glue using Apache Spark Firestore 在收集具有大数据 ~6K 文档时创建查询需要很长时间 - Firestore create query taking long when collection having large data ~6K documents 自定义端点上来自 s3a 的 Spark 加载数据停滞 - Spark load data from s3a on a custom endpoint gets stalled
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM