简体   繁体   English

读取一个大的 csv 文件然后将其拆分导致 OOM 错误

[英]Reading a large csv file then splitting it causing a OOM error

Hi I'm creating a GLUE job that will read the csv file then split it via a particular column, unfortunately it's causing an OOM(Out of Memory) error.嗨,我正在创建一个 GLUE 作业,它将读取 csv 文件,然后通过特定列将其拆分,不幸的是,它导致了OOM(Out of Memory)错误。 Please see code below请看下面的代码

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import boto3


#get date 
Current_Date = datetime.now() - timedelta(days=1)
now = Current_Date.strftime('%Y-%m-%d')

#get date
Previous_Date = datetime.now() - timedelta(days=2)
prev = Previous_Date.strftime('%Y-%m-%d')

#read csv file that contain today's date
filepath = "s3://bucket/file"+now+".csv.gz"

data = pd.read_csv(filepath, sep='|', header=None,compression='gzip') 

#   count no. of loops
loop = 0
for i, x in data.groupby(data[10].str.slice(0,10)):
    loop += 1

# if no. of distinct values of column 10 (last_update) is greater than or equal to 7
if loop >= 7:
    #run loop for the dataframe and split by distinct values of column 10 (last_update)
    for i, x in data.groupby(data[10].str.slice(0, 10)):
        x.to_csv("s3://bucket/file.csv.gz".format(i.lower()),header=None,compression='gzip')

#if no. of distinct values of column 10 (last_update) is less than 7
#filter dateframe (current date and previous date); new dataframe is created
else:
    d = data[(data[10].str.slice(0,10)==prev)|(data[10].str.slice(0,10)==now)]
#run loop for the filtered data frame and split by distinct values of column 10 (last_update)
 for i, x in d.groupby(d[10].str.slice(0, 10)):
        x.to_csv("s3://bucket/file.csv.gz".format(i.lower()),header=None,compression='gzip')

SOLUTION - I resolved this problem by increasing the maximum capacity of the Glue Job解决方案 - 我通过增加胶水作业的最大容量解决了这个问题

Not sure how big your file size is but if you split the file by chunks you should be able to avoid the error.不确定您的文件大小有多大,但如果您按块拆分文件,您应该能够避免该错误。 We have successfully tested with a 2.5gb file using this method.我们已经使用这种方法成功地测试了一个 2.5gb 的文件。 Also if you are using the python shell remember to update your glue job maximum capacity to 1此外,如果您使用的是 python shell,请记住将您的胶水作业最大容量更新为 1

data = pd.read_csv(filepath, chunksize=1000, iterator=True) 
for chunk in enumerate(data):
#Loop through the chunks and process the data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM