简体   繁体   English

aws Glue 作业:如何在 s3 中合并多个 output.csv 文件

[英]aws Glue job: how to merge multiple output .csv files in s3

I created an aws Glue Crawler and job.我创建了一个 aws Glue Crawler 和作业。 The purpose is to transfer data from a postgres RDS database table to one single.csv file in S3.目的是将数据从 postgres RDS 数据库表传输到 S3 中的单个 .csv 文件。 Everything is working, but I get a total of 19 files in S3.一切正常,但我在 S3 中总共获得了 19 个文件。 Every file is empty except three with one row of the database table in it as well as the headers.除了三个文件外,每个文件都是空的,其中包含一行数据库表以及标题。 So every row of the database is written to a seperate.csv file.因此,数据库的每一行都写入一个单独的 .csv 文件。 What can I do here to specify that I want only one file where the first row are the headers and afterwards every line of the database follows?我可以在这里做什么来指定我只想要一个文件,其中第一行是标题,然后是数据库的每一行?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("password", "string", "password", "string"), ("user_id", "string", "user_id", "string"), ("username", "string", "username", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("user_id", "string", "user_id", "string"), ("username", "string", "username", "string"),("password", "string", "password", "string")], transformation_ctx = "applymapping1")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2"]
## @return: datasink2
## @inputs: [frame = applymapping1]
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

The database looks like this: Databse picture数据库看起来像这样:数据库图片

It looks like this in S3: S3 Bucket在 S3 中看起来像这样: S3 Bucket

One sample.csv in the S3 looksl ike this: S3 中的一个 sample.csv 看起来像这样:

password,user_id,username
346sdfghj45g,user3,dieter

As I said, there is one file for each table row.正如我所说,每个表格行都有一个文件。

Edit: The multipartupload to s3 doesn't seem to work correctly.编辑:到 s3 的 multipartupload 似乎不能正常工作。 It just uplaods the parts but doesn't merge them together when finished.它只是上传零件,但在完成后不会将它们合并在一起。 Here are the last lines of the job log: Here are the last lines of the log:这是作业日志的最后几行: 这是日志的最后几行:

19/04/04 13:26:41 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/04/04 13:26:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/04/04 13:26:41 INFO Executor: Finished task 16.0 in stage 2.0 (TID 18). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:false s3://bucketname/run-1554384396528-part-r-00018
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00017
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:false s3://bucketname/run-1554384396528-part-r-00019
19/04/04 13:26:41 INFO Executor: Finished task 17.0 in stage 2.0 (TID 19). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00018
19/04/04 13:26:41 INFO Executor: Finished task 18.0 in stage 2.0 (TID 20). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00019
19/04/04 13:26:41 INFO Executor: Finished task 19.0 in stage 2.0 (TID 21). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver from 172.31.20.76:39779 disconnected during shutdown
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver from 172.31.20.76:39779 disconnected during shutdown
19/04/04 13:26:41 INFO MemoryStore: MemoryStore cleared
19/04/04 13:26:41 INFO BlockManager: BlockManager stopped
19/04/04 13:26:41 INFO ShutdownHookManager: Shutdown hook called
End of LogType:stderr

Can you try the following? 你能试试以下吗?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("user_id", "string", "user_id", "string"), ("username", "string", "username", "string"),("password", "string", "password", "string")], transformation_ctx = "applymapping1")

## Force one partition, so it can save only 1 file instead of 19
repartition = applymapping1.repartition(1)

datasink2 = glueContext.write_dynamic_frame.from_options(frame = repartition, connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

Also, if you want to check how many partitions you have currently, you can try the following code. 此外,如果要检查当前有多少分区,可以尝试以下代码。 I'm guessing there are 19, that's why is saving 19 files back to s3: 我猜有19,这就是为什么将19个文件保存回s3:

 ## Change to Pyspark Dataframe
 dataframe = DynamicFrame.toDF(applymapping1)
 ## Print number of partitions   
 print(dataframe.rdd.getNumPartitions())
 ## Change back to DynamicFrame
 datasink2 = DynamicFrame.fromDF(dataframe, glueContext, "datasink2")

If these are not big datasets, then you could easily consider converting glue dynamicframe (glue_dyf) to spark df (spark_df) and then spark df to pandas df (pandas_df) as below:如果这些不是大数据集,那么您可以轻松考虑将 glue dynamicframe (glue_dyf) 转换为 spark df (spark_df),然后将 spark df 转换为 pandas df (pandas_df),如下所示:

spark_df = DynamicFrame.toDF(glue_dyf)
pandas_df = spark_df.toPandas()
pandas_df.to_csv("s3://BUCKETNAME/subfolder/FileName.csv",index=False)

In this method, you need not worry about repartition for small volumes of data.使用这种方法,您不必担心对小数据量重新分区。 Large datasets are advised to be treated like the previous answers by leveraging glue workers, spark partitions..建议通过利用胶水工人、火花分区来像以前的答案一样对待大型数据集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM