AWS Glue Python Shell 作業因 MemoryError 而失敗

Question

我有 AWS Glue Python Shell 作業在運行大約一分鍾后失敗，處理 2 GB 文本文件。 該作業對文件進行少量編輯，例如查找和刪除某些行、刪除行中的最后一個字符以及根據條件添加回車符。 對於小於 1 GB 的文件大小，相同的作業運行得很好。

作業“最大容量設置”為 1。
“最大並發”為 2880。
“作業超時（分鍾）”為 900。

詳細的失敗信息：

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-9g022ft7/pysh-tf-bb-to-parquet.py", line 134, in <module>
MemoryError

我嘗試運行的實際 python 代碼：

import boto3
import json
import os
import sys
from sys import getsizeof
import datetime
from datetime import datetime
import os
import psutil
import io 
import pandas as pd 
import pyarrow as pa #not supported by glue
import pyarrow.parquet as pq #not supported by glue
import s3fs #not supported by glue

#Object parameters (input and output).
s3region = 'reducted' 
s3bucket_nm = 'reducted' 

#s3 inbound object parameters.
s3object_inbound_key_only = 'reducted' 
s3object_inbound_folder_only = 'reducted' 
s3object_inbound_key = s3object_inbound_folder_only + '/' + s3object_inbound_key_only 

#s3 object base folder parameter.
s3object_base_folder = s3object_inbound_key_only[:-9].replace('.', '_')

#s3 raw object parameters.
s3object_raw_key_only = s3object_inbound_key_only
s3object_raw_folder_only = 'reducted' + s3object_base_folder
s3object_raw_key = s3object_raw_folder_only + '/' + s3object_inbound_key_only

#s3 PSV object parameters.
s3object_psv_key_only = s3object_inbound_key_only + '.psv'
s3object_psv_folder_only = 'reducted' + s3object_base_folder + '_psv'
s3object_psv_key = s3object_psv_folder_only + '/' + s3object_psv_key_only
s3object_psv_crawler = s3object_base_folder + '_psv'

glue_role = 'reducted'

processed_immut_db = 'reducted'

#Instantiate s3 client.
s3client = boto3.client(
    's3',
    region_name = s3region
)

#Instantiate s3 resource.
s3resource = boto3.resource(
    's3',
    region_name = s3region
)

#Store raw object metadata as a dictionary variable.
s3object_raw_dict = {
    'Bucket': s3bucket_nm,
    'Key': s3object_inbound_key
}

#Create raw file object.
s3object_i = s3client.get_object(
    Bucket = s3bucket_nm,
    Key = s3object_raw_folder_only + '/' + s3object_raw_key_only
)

#Initialize the list to hold the raw file data string.
l_data = []

#Load s_data string into a list and transform.
for line in (''.join((s3object_i['Body'].read()).decode('utf-8'))).splitlines():
    #Once the line with the beginning of the field list tag is reached, re-initialize the list.
    if line.startswith('START-OF-FIELDS'):
        l_data = []
    #Load (append) the input file into the list.
    l_data.append(line + '\n')
    #Once the line with the end of the field list tag is reached, remove the field metadata tags.
    if line.startswith('END-OF-FIELDS'):
    #Remove the blank lines.
        l_data=[line for line in l_data if '\n' != line]
        #Remove lines with #.
        l_data=[line for line in l_data if '#' not in line]
        #Remove the tags signifying the the start and end of the field list.
        l_data.remove('START-OF-FIELDS\n')
        l_data.remove('END-OF-FIELDS\n')
        #Remove the new line characters (\n) from each field name (assuming the last character in each element).
        l_data=list(map(lambda i: i[:-1], l_data))
        #Insert "missing" field names in the beginning of the header.
        l_data.insert(0, 'BB_FILE_DT')
        l_data.insert(1, 'BB_ID')
        l_data.insert(2, 'RETURN_CD')
        l_data.insert(3, 'NO_OF_FIELDS')
        #Add | delimiter to each field.
        l_data=[each + "|" for each in l_data]
        #Concatenate all header elements into a single element.
        l_data = [''.join(l_data[:])]
    #Once the line with the end of data dataset tag is reached, remove the dataset metadata tags.
    if line.startswith('END-OF-FILE'):
        #Remove TIMESTARTED metadata.
        l_data=[line for line in l_data if 'TIMESTARTED' not in line]
        #Remove lines with #.
        l_data=[line for line in l_data if '#' not in line]
        #Remove the tags signifying the the start and end of the dataset.
        l_data.remove('START-OF-DATA\n')
        l_data.remove('END-OF-DATA\n')
        #Remove DATARECORDS metadata.
        l_data=[line for line in l_data if 'DATARECORDS' not in line]
        #Remove TIMEFINISHED metadata.
        l_data=[line for line in l_data if 'TIMEFINISHED' not in line]
        #Remove END-OF-FILE metadata.
        l_data=[line for line in l_data if 'END-OF-FILE' not in line]

#Store the file header into a variable.
l_data_header=l_data[0][:-1] + '\n'

#Add the column with the name of the inbound file to all elements of the file body.
l_data_body=[s3object_inbound_key_only[-8:] + '|' + line[:-2] + '\n' for line in l_data[2:]]

#Combine the file header and file body into a single list.
l_data_body.insert(0, l_data_header)

#Load the transformed list into a string variable.
s3object_o_data = ''.join(l_data_body)

#Write the transformed list from a string variable to a new s3 object.
s3resource.Object(s3bucket_nm, s3object_psv_folder_only + '/' + s3object_psv_key_only).put(Body=s3object_o_data)

我已經確定“MemoryError”是由下面的代碼行引起的。 s3object_i_data_decoded 包含我之前提到的 2 GB 文件。 在執行這行代碼之前，python 進程占用的總 memory 為 2.025 GB。 在這行代碼運行后，memory 的使用率似乎急劇上升：

#Load the transformed list into a string variable.
s3object_o_data = ''.join(l_data_body)

在代碼運行期間測量過程 memory 大小后，我發現每當將一個列表變量加載到另一個變量中時，memory 的數量幾乎是四倍。 因此，當分配給另一個變量時，2 GB 列表變量會導致 memory 大小增長到 6+ GB。 ：/

我還假設 Glue Python Shell 作業難以處理超過 2GB 大小范圍的文件...有人可以確認嗎？

在處理大於 2 GB 的文件時，是否有其他人遇到過此錯誤？
是否可以對這項工作進行任何調整以避免這種“MemoryError”？
2 GB 數據集對於 Glue Python Shell Job 是否太大，也許應該考慮 Glue Spark。

理論上，我可以通過代碼本身將作業分成更小的批次，但想看看是否有較低的懸而未決的果實。

如果沒有必要，我真的很想調整現有工作並避免為此使用 Glue Spark。

在此先感謝大家分享他們的想法：:)

Answer 1

如果您可以顯示代碼片段，那就太好了。 1 個 DPU 為您提供 4 個 vcore 16 GB memory，這足以處理您的數據。

您可以做的最好的事情是將文件作為 StreamingBody 讀取，然后以塊的形式執行您的操作。 你可以參考這里

基本上，最好利用 s3 的 Streaming 功能。

如果您分享如何讀取和寫入文件，則可以分享更多見解，因為 2GB 文件在這里沒什么大不了的。

我有多個建議，如果您希望可以實施它們： 1. 在處理文件時，不要將整個文件讀取到 memory 中，而是逐行加載它。

for line in s3object_i['Body'].iter_lines():

您一次又一次地使用列表理解來過濾數據，而不是您可以創建一個復合語句，因為這會增加代碼的時間復雜度並進行優化，例如

    if line.startswith('END-OF-FIELDS'):
        l_data.insert(0, 'BB_FILE_DT')
        l_data.insert(1, 'BB_ID')
        l_data.insert(2, 'RETURN_CD')
        l_data.insert(3, 'NO_OF_FIELDS')
        l_data=[line + "|" for line in l_data if ('' != line) and ('#' not in line)]
        l_data.remove('START-OF-FIELDS')
        l_data.remove('END-OF-FIELDS')
        l_data = [''.join(l_data[:])]

#and
    if line.startswith('END-OF-FILE'):
        l_data.remove('START-OF-DATA')
        l_data.remove('END-OF-DATA')
        l_data=[line for line in l_data if ('TIMESTARTED' not in line) and ('#' not in line) and ('DATARECORDS' not in line) and ('TIMEFINISHED' not in line) and ('END-OF-FILE' not in line)]

要將文件保存回 s3，您可以利用分段上傳，也可以創建生成器 object 而不是列表，然后將結果生成到 s3。 喜歡

def uploadFileS3():
#for uploading 25 mb chunks to s3
    config = TransferConfig(multipart_threshold=1024*25, max_concurrency=10,
                        multipart_chunksize=1024*25, use_threads=True)

    s3_client.upload_file(file, S3_BUCKET, key,
    Config = config,
    Callback=ProgressPercentage(''.join(l_data))
    )


------------------------------------------------------------
#or a bit tricky to implement but worth it
------------------------------------------------------------
def file_stream():
    for line in l_data:
        yield line

# we have to keep track of all of our parts
part_info_dict = {'Parts': []}
# start the multipart_upload process
multi_part_upload = s3.create_multipart_upload(Bucket=bucket_name, Key=temp_key)

# Part Indexes are required to start at 1
for part_index, line in enumerate(file_stream(), start=1):
    # store the return value from s3.upload_part for later
    part = s3.upload_part(
        Bucket=bucket_name,
        Key=temp_key,
        # PartNumber's need to be in order and unique
        PartNumber=part_index,
        # This 'UploadId' is part of the dict returned in multi_part_upload
        UploadId=multi_part_upload['UploadId'],
        # The chunk of the file we're streaming.
        Body=line,
    )

    # PartNumber and ETag are needed
    part_info_dict['Parts'].append({
        'PartNumber': part_index,
        # You can get this from the return of the uploaded part that we stored earlier
        'ETag': part['ETag']
    })

    # This what AWS needs to finish the multipart upload process
    completed_ctx = {
        'Bucket': bucket_name,
        'Key': temp_key,
        'UploadId': multi_part_upload['UploadId'],
        'MultipartUpload': part_info_dict
    }

# Complete the upload. This triggers Amazon S3 to rebuild the file for you.
# No need to manually unzip all of the parts ourselves!
s3.complete_multipart_upload(**completed_ctx)

如果您可以實施這些更改，那么您甚至可以在膠水 python shell 中處理 5GB 文件。 關鍵是更好地優化代碼。

希望你明白這一點。

謝謝。

AWS Glue Python Shell 作業因 MemoryError 而失敗

問題描述

1 個解決方案

解決方案1
1 2020-04-30 06:02:47

AWS Glue Python Shell 作業因 MemoryError 而失敗

問題描述

1 個解決方案

解決方案1 1 2020-04-30 06:02:47

解決方案1
1 2020-04-30 06:02:47