简体   繁体   English

如何在 AWS Glue python shell 作业中将数据批量插入 MSSQL 数据库?

[英]How to Bulk insert data into MSSQL database in a AWS Glue python shell job?

I have large sets of data in s3.我在 s3 中有大量数据。 In my Python glue job, I will be extracting data from those files in the form of a pandas data frame and apply necessary transformations on the data frame and then load it into Microsoft SQL database using PYMSSQL library.在我的 Python 粘合作业中,我将以 pandas 数据框的形式从这些文件中提取数据,并对数据框进行必要的转换,然后使用 PYMSSQL 库将其加载到 Microsoft SQL 数据库中。 The final data frame contains an average of 100-200K rows and 180 columns of data.最终的数据帧平均包含 100-200K 行和 180 列数据。 Currently I am using PYMSSQL to connect to the database.目前我正在使用 PYMSSQL 连接到数据库。 The problem is executemany of the cursor class takes too much to load the data.问题是执行许多游标类需要太多来加载数据。 Approximately 20 Min for 100k rows. 100k 行大约需要 20 分钟。 I checked the logs and it was always the loading which is slow.我检查了日志,加载速度总是很慢。 screenshot attached.附上截图。 How to load them faster?如何更快地加载它们? I am attaching my code here:我在这里附上我的代码:

file=s3.get_object(Bucket=S3_BUCKET_NAME,Key=each_file)
for chunk in pd.read_csv(file['Body'],sep=",",header=None,low_memory=False,chunksize=100000):
 all_data.append(chunk)

data_frame = pd.concat(all_data, axis= 0)
all_data.clear()
cols = data_frame.select_dtypes(object).columns
    data_frame[cols] = data_frame[cols].apply(lambda x: x.str.strip())
    data_frame.replace(to_replace ='',value =np.nan,inplace=True)
    data_frame.fillna(value=np.nan, inplace=True)
    data_frame.insert(0,'New-column', 1111)
    sql_data_array =data_frame.replace({np.nan:None}).to_numpy()
    sql_data_tuple=tuple(map(tuple, sql_data_array))
try:
    sql="insert into [db].[schema].[table](column_names)values(%d,%s,%s,%s,%s,%s...)"
    db_cursor.executemany(sql,sql_data_tuple)
    print("loading completed on {}".format(datetime.datetime.now()))
except Exception as e:
    print(e)

I ended up doing this and gave me much better results(1 Million in 11 Min): (Use Glue 2.0 python job instead of python shell job)我最终这样做并给了我更好的结果(11 分钟内达到 100 万):(使用 Glue 2.0 python 作业而不是 python shell 作业)

  1. Extracted the data from s3从 s3 中提取数据

  2. Transformed it using Pandas使用 Pandas 对其进行改造

  3. Uploaded the transformed file as a CSV to s3.将转换后的文件作为 CSV 上传到 s3。

  4. Created a dynamic frame from a catalog table that was created using a crawler by crawling the transformed CSV file.从目录表创建一个动态框架,该目录表是使用爬虫通过爬取转换后的 CSV 文件创建的。 Or You can create dynamic frame directly using Options.或者您可以直接使用选项创建动态框架。

  5. Synchronize the dynamic frame to the catalog table that was created using a crawler by crawling the Destination MSSQL table.通过对 Destination MSSQL 表进行爬网,将动态框架同步到使用爬网程序创建的目录表。

     csv_buffer = StringIO() s3_resource = boto3.resource("s3", region_name=AWS_REGION) file=s3.get_object(Bucket=S3_BUCKET_NAME,Key=each_file) for chunk in pd.read_csv(file['Body'],sep=",",header=None,low_memory=False,chunksize=100000): all_data.append(chunk) data_frame = pd.concat(all_data, axis= 0) all_data.clear() cols = data_frame.select_dtypes(object).columns data_frame[cols] = data_frame[cols].apply(lambda x: x.str.strip()) data_frame.replace(to_replace ='',value =np.nan,inplace=True) data_frame.fillna(value=np.nan, inplace=True) data_frame.insert(0,'New-column', 1234) data_frame.to_csv(csv_buffer) result=s3_resource.Object(S3_BUCKET_NAME, 'path in s3').put(Body=csv_buffer.getvalue())

    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "source db name", table_name = "source table name", transformation_ctx = "datasource0") datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "源数据库名",table_name = "源表名",transformation_ctx = "datasource0")

    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [mappings], transformation_ctx = "applymapping1") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [mappings], transformation_ctx = "applymapping1")

    selectfields2 = SelectFields.apply(frame = applymapping1, paths = [column names of destination catalog table], transformation_ctx = "selectfields2") selectfields2 = SelectFields.apply(frame = applymapping1, paths = [目标目录表的列名], transformation_ctx = "selectfields2")

    resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "destination dbname", table_name = "destination table name", transformation_ctx = "resolvechoice3") resolvechoice3 = ResolveChoice.apply(frame = selectfields2,choice = "MATCH_CATALOG", database = "destination dbname", table_name = "destination table name", transformation_ctx = "resolvechoice3")

    resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4") resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3,choice = “make_cols”,transformation_ctx = “resolvechoice4”)

    datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "destination db name", table_name = "destination table name", transformation_ctx = "datasink5") datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "destination db name", table_name = "destination table name", transformation_ctx = "datasink5")

    job.commit()工作提交()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM