AWS Glue Python Shell 作业因 MemoryError 而失败

Question

I have AWS Glue Python Shell Job that fails after running for about a minute, processing 2 GB text file.我有 AWS Glue Python Shell 作业在运行大约一分钟后失败，处理 2 GB 文本文件。 The job does minor edits to the file like finding and removing some lines, removing last character in a line and adding carriage returns based on conditions.该作业对文件进行少量编辑，例如查找和删除某些行、删除行中的最后一个字符以及根据条件添加回车符。 Same job runs just fine for file sizes below 1 GB.对于小于 1 GB 的文件大小，相同的作业运行得很好。

Job "Maximum capacity setting" is 1.作业“最大容量设置”为 1。
"Max concurrency" is 2880. “最大并发”为 2880。
"Job timeout (minutes)" is 900. “作业超时（分钟）”为 900。

Detailed failure message:详细的失败信息：

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-9g022ft7/pysh-tf-bb-to-parquet.py", line 134, in <module>
MemoryError

Actual python code that I am trying to run:我尝试运行的实际 python 代码：

import boto3
import json
import os
import sys
from sys import getsizeof
import datetime
from datetime import datetime
import os
import psutil
import io 
import pandas as pd 
import pyarrow as pa #not supported by glue
import pyarrow.parquet as pq #not supported by glue
import s3fs #not supported by glue

#Object parameters (input and output).
s3region = 'reducted' 
s3bucket_nm = 'reducted' 

#s3 inbound object parameters.
s3object_inbound_key_only = 'reducted' 
s3object_inbound_folder_only = 'reducted' 
s3object_inbound_key = s3object_inbound_folder_only + '/' + s3object_inbound_key_only 

#s3 object base folder parameter.
s3object_base_folder = s3object_inbound_key_only[:-9].replace('.', '_')

#s3 raw object parameters.
s3object_raw_key_only = s3object_inbound_key_only
s3object_raw_folder_only = 'reducted' + s3object_base_folder
s3object_raw_key = s3object_raw_folder_only + '/' + s3object_inbound_key_only

#s3 PSV object parameters.
s3object_psv_key_only = s3object_inbound_key_only + '.psv'
s3object_psv_folder_only = 'reducted' + s3object_base_folder + '_psv'
s3object_psv_key = s3object_psv_folder_only + '/' + s3object_psv_key_only
s3object_psv_crawler = s3object_base_folder + '_psv'

glue_role = 'reducted'

processed_immut_db = 'reducted'

#Instantiate s3 client.
s3client = boto3.client(
    's3',
    region_name = s3region
)

#Instantiate s3 resource.
s3resource = boto3.resource(
    's3',
    region_name = s3region
)

#Store raw object metadata as a dictionary variable.
s3object_raw_dict = {
    'Bucket': s3bucket_nm,
    'Key': s3object_inbound_key
}

#Create raw file object.
s3object_i = s3client.get_object(
    Bucket = s3bucket_nm,
    Key = s3object_raw_folder_only + '/' + s3object_raw_key_only
)

#Initialize the list to hold the raw file data string.
l_data = []

#Load s_data string into a list and transform.
for line in (''.join((s3object_i['Body'].read()).decode('utf-8'))).splitlines():
    #Once the line with the beginning of the field list tag is reached, re-initialize the list.
    if line.startswith('START-OF-FIELDS'):
        l_data = []
    #Load (append) the input file into the list.
    l_data.append(line + '\n')
    #Once the line with the end of the field list tag is reached, remove the field metadata tags.
    if line.startswith('END-OF-FIELDS'):
    #Remove the blank lines.
        l_data=[line for line in l_data if '\n' != line]
        #Remove lines with #.
        l_data=[line for line in l_data if '#' not in line]
        #Remove the tags signifying the the start and end of the field list.
        l_data.remove('START-OF-FIELDS\n')
        l_data.remove('END-OF-FIELDS\n')
        #Remove the new line characters (\n) from each field name (assuming the last character in each element).
        l_data=list(map(lambda i: i[:-1], l_data))
        #Insert "missing" field names in the beginning of the header.
        l_data.insert(0, 'BB_FILE_DT')
        l_data.insert(1, 'BB_ID')
        l_data.insert(2, 'RETURN_CD')
        l_data.insert(3, 'NO_OF_FIELDS')
        #Add | delimiter to each field.
        l_data=[each + "|" for each in l_data]
        #Concatenate all header elements into a single element.
        l_data = [''.join(l_data[:])]
    #Once the line with the end of data dataset tag is reached, remove the dataset metadata tags.
    if line.startswith('END-OF-FILE'):
        #Remove TIMESTARTED metadata.
        l_data=[line for line in l_data if 'TIMESTARTED' not in line]
        #Remove lines with #.
        l_data=[line for line in l_data if '#' not in line]
        #Remove the tags signifying the the start and end of the dataset.
        l_data.remove('START-OF-DATA\n')
        l_data.remove('END-OF-DATA\n')
        #Remove DATARECORDS metadata.
        l_data=[line for line in l_data if 'DATARECORDS' not in line]
        #Remove TIMEFINISHED metadata.
        l_data=[line for line in l_data if 'TIMEFINISHED' not in line]
        #Remove END-OF-FILE metadata.
        l_data=[line for line in l_data if 'END-OF-FILE' not in line]

#Store the file header into a variable.
l_data_header=l_data[0][:-1] + '\n'

#Add the column with the name of the inbound file to all elements of the file body.
l_data_body=[s3object_inbound_key_only[-8:] + '|' + line[:-2] + '\n' for line in l_data[2:]]

#Combine the file header and file body into a single list.
l_data_body.insert(0, l_data_header)

#Load the transformed list into a string variable.
s3object_o_data = ''.join(l_data_body)

#Write the transformed list from a string variable to a new s3 object.
s3resource.Object(s3bucket_nm, s3object_psv_folder_only + '/' + s3object_psv_key_only).put(Body=s3object_o_data)

I have determined that the "MemoryError" is caused by the line of code below.我已经确定“MemoryError”是由下面的代码行引起的。 s3object_i_data_decoded contains the 2 GB file I've mentioned about earlier. s3object_i_data_decoded 包含我之前提到的 2 GB 文件。 Total memory occupied by the python process prior to execution of this line of code is 2.025 GB.在执行这行代码之前，python 进程占用的总 memory 为 2.025 GB。 Looks like memory usage jumps dramatically after this line of code runs:在这行代码运行后，memory 的使用率似乎急剧上升：

#Load the transformed list into a string variable.
s3object_o_data = ''.join(l_data_body)

After measuring the process memory size during the run of the code, I found that whenever a list variable is loaded into another variable, the amount of memory used almost quadruples.在代码运行期间测量过程 memory 大小后，我发现每当将一个列表变量加载到另一个变量中时，memory 的数量几乎是四倍。 So 2 GB list variable when assigned to another variable cause the process to memory size grow to 6+ GB.因此，当分配给另一个变量时，2 GB 列表变量会导致 memory 大小增长到 6+ GB。 :/ ：/

I am also assuming Glue Python Shell Jobs have difficulty handling files exceeding 2GB size range... can anyone confirm this?我还假设 Glue Python Shell 作业难以处理超过 2GB 大小范围的文件...有人可以确认吗？

Has anyone else experienced this error when processing files larger than 2 GB?在处理大于 2 GB 的文件时，是否有其他人遇到过此错误？
Are there any tweaks can be done to the job to avoid this "MemoryError"?是否可以对这项工作进行任何调整以避免这种“MemoryError”？
Are 2 GB data sets just too large for Glue Python Shell Job and perhaps the Glue Spark should be considered. 2 GB 数据集对于 Glue Python Shell Job 是否太大，也许应该考虑 Glue Spark。

I could theoretically partition the job into smaller batches via the code itself, but wanted to see if there is lower hanging fruit.理论上，我可以通过代码本身将作业分成更小的批次，但想看看是否有较低的悬而未决的果实。

I'd really like to tweak the existing job and avoid using Glue Spark for this, if not necessary.如果没有必要，我真的很想调整现有工作并避免为此使用 Glue Spark。

Thanks in advance to everyone for sharing their ideas: :)在此先感谢大家分享他们的想法：:)

Answer 1

If you could show the code snippet that would be great.如果您可以显示代码片段，那就太好了。 1 DPU provides you 4 vcores 16 GB memory which is more than enough to process your data. 1 个 DPU 为您提供 4 个 vcore 16 GB memory，这足以处理您的数据。

The best you can do is read the file as StreamingBody and than perform your operations in chunks.您可以做的最好的事情是将文件作为 StreamingBody 读取，然后以块的形式执行您的操作。 You can refer it here你可以参考这里

Basically, it is best if you utilize the Streaming capabilities of s3.基本上，最好利用 s3 的 Streaming 功能。

Else more insight can be shared if you share how you are reading and writing file as 2GB file is no big deal here.如果您分享如何读取和写入文件，则可以分享更多见解，因为 2GB 文件在这里没什么大不了的。

I have multiple suggestions and if you wish you can implement them: 1. Instead of reading the whole file into memory load it line by line as you are processing it.我有多个建议，如果您希望可以实施它们： 1. 在处理文件时，不要将整个文件读取到 memory 中，而是逐行加载它。

for line in s3object_i['Body'].iter_lines():

You are using list comprehension again and again just for filtering the data instead you can create a compound statement as this increases the time complexity of your code and be optimized like您一次又一次地使用列表理解来过滤数据，而不是您可以创建一个复合语句，因为这会增加代码的时间复杂度并进行优化，例如

    if line.startswith('END-OF-FIELDS'):
        l_data.insert(0, 'BB_FILE_DT')
        l_data.insert(1, 'BB_ID')
        l_data.insert(2, 'RETURN_CD')
        l_data.insert(3, 'NO_OF_FIELDS')
        l_data=[line + "|" for line in l_data if ('' != line) and ('#' not in line)]
        l_data.remove('START-OF-FIELDS')
        l_data.remove('END-OF-FIELDS')
        l_data = [''.join(l_data[:])]

#and
    if line.startswith('END-OF-FILE'):
        l_data.remove('START-OF-DATA')
        l_data.remove('END-OF-DATA')
        l_data=[line for line in l_data if ('TIMESTARTED' not in line) and ('#' not in line) and ('DATARECORDS' not in line) and ('TIMEFINISHED' not in line) and ('END-OF-FILE' not in line)]

For saving the file back to s3, you can leverage multipart upload or you can create a generator object instead of list and then yield the results to s3.要将文件保存回 s3，您可以利用分段上传，也可以创建生成器 object 而不是列表，然后将结果生成到 s3。 like喜欢

def uploadFileS3():
#for uploading 25 mb chunks to s3
    config = TransferConfig(multipart_threshold=1024*25, max_concurrency=10,
                        multipart_chunksize=1024*25, use_threads=True)

    s3_client.upload_file(file, S3_BUCKET, key,
    Config = config,
    Callback=ProgressPercentage(''.join(l_data))
    )


------------------------------------------------------------
#or a bit tricky to implement but worth it
------------------------------------------------------------
def file_stream():
    for line in l_data:
        yield line

# we have to keep track of all of our parts
part_info_dict = {'Parts': []}
# start the multipart_upload process
multi_part_upload = s3.create_multipart_upload(Bucket=bucket_name, Key=temp_key)

# Part Indexes are required to start at 1
for part_index, line in enumerate(file_stream(), start=1):
    # store the return value from s3.upload_part for later
    part = s3.upload_part(
        Bucket=bucket_name,
        Key=temp_key,
        # PartNumber's need to be in order and unique
        PartNumber=part_index,
        # This 'UploadId' is part of the dict returned in multi_part_upload
        UploadId=multi_part_upload['UploadId'],
        # The chunk of the file we're streaming.
        Body=line,
    )

    # PartNumber and ETag are needed
    part_info_dict['Parts'].append({
        'PartNumber': part_index,
        # You can get this from the return of the uploaded part that we stored earlier
        'ETag': part['ETag']
    })

    # This what AWS needs to finish the multipart upload process
    completed_ctx = {
        'Bucket': bucket_name,
        'Key': temp_key,
        'UploadId': multi_part_upload['UploadId'],
        'MultipartUpload': part_info_dict
    }

# Complete the upload. This triggers Amazon S3 to rebuild the file for you.
# No need to manually unzip all of the parts ourselves!
s3.complete_multipart_upload(**completed_ctx)

If you can implement these changes then you can process even 5GB file also in glue python shell.如果您可以实施这些更改，那么您甚至可以在胶水 python shell 中处理 5GB 文件。 The key is to better optimize the code.关键是更好地优化代码。

Hope you get the point.希望你明白这一点。

Thanks.谢谢。

AWS Glue Python Shell 作业因 MemoryError 而失败

问题描述

1 个解决方案

解决方案1
1 2020-04-30 06:02:47

AWS Glue Python Shell 作业因 MemoryError 而失败

问题描述

1 个解决方案

解决方案1 1 2020-04-30 06:02:47

解决方案1
1 2020-04-30 06:02:47