简体   繁体   中英

StackOverflowError in AWS Glue while reading a million JSON files in Spark

The error happens if we try to load over a million JSON files in Spark from S3 as part of an AWS Glue job

spark_df = spark.read.load(s3_json_path, format='json')

700,000 files work fine, but over one million files result in a StackOverflowError . I have used Python 3, Glue 2.0, worker type G.1X and 10 workers (using G.2X or more workers to increase capacity does not help). The stack trace is

ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/glue_job.py", line 107, in <module>
    spark_df = spark.read.load(s3_json_path, format='json')
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.StackOverflowError

As a real StackOverflow error it fits perfectly to this site. Spark should be able to load a million files, where is the problem? What setting do I have to tweak to make it work?

I have not found a solution but I managed to find a workaround. Apparently Spark/PySpark has a problem reading too many files from S3. A number of pages have explained how many small files from S3 should not be pulled in Spark . It looks like big data has a small files problem . What we can do as a workaround is to read multiple JSON files from S3 in parallel by using Boto3 and custom functions:

import json
import boto3

# read all JSON filenames from S3 using Boto3 & pagination
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
file_list = []
for page in pages:
    for obj in page['Contents']:
        file_list.append(obj['Key'])

# it is important to define the S3 client *inside* the function
# because the S3 client is not serialisable
def load_json(key):
    s3_client = boto3.client('s3')
    json_data = s3_client.get_object(Bucket=bucket, Key=key)['Body'].read()
    return json.loads(json_data)

# read the JSON files in parallel by a custom loading function
data = sparkContext.parallelize(file_list).map(load_json)
dataSchema = StructType([StructField("name", StringType(), True),.. ])
spark_df = spark.createDataFrame(data, schema=dataSchema)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM