简体   繁体   English

在 AWS glue dynamic dataframe 中添加一列

[英]Adding a column in AWS glue dynamic dataframe

I am very new to AWS Glue.我对 AWS Glue 很陌生。 I am working on a small project and the ask is to read a file from S3 bucket, transpose it and load it in a mysql table.我正在做一个小项目,要求从 S3 存储桶中读取一个文件,转置它并将其加载到 mysql 表中。 The source data in S3 bucket looks as below S3 存储桶中的源数据如下所示

    +----+----+-------+-----+---+--+--------+
    |cost|data|minutes|name |sms|id|category|
    +----+----+-------+-----+---+--+--------+
    |  5 |1000|  200  |prod1|500|p1|service |
    +----+----+-------+-----+---+--+--------+

The target table structure is Product_id, Parameter, value目标表结构为Product_id,Parameter,value

I am expecting target table to have following values我期望目标表具有以下值

p1, cost, 5 p1,成本,5

P1, data, 1000 P1,数据,1000

I am able to load the target table with ID and Value.我能够使用 ID 和值加载目标表。 But I am not able to populate the parameter column.但我无法填充参数列。 This column is not present in the input data and I want to populate a string depending on which column value I am populating.此列不存在于输入数据中,我想根据要填充的列值填充一个字符串。

Here is the code I used for cost.这是我用于成本的代码。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

## @type: DataSource
## @args: [database = "mainclouddb", table_name = "s3product", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mainclouddb", table_name = "s3product", transformation_ctx = "datasource0")

## @type: ApplyMapping
## @args: [mapping = [("cost", "long", "value", "int"), ("id", "string", "product_id", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("cost", "long", "value", "int"), ("id", "string", "product_id", "string")], transformation_ctx = "applymapping1")

## @type: SelectFields
## @args: [paths = ["product_id", "parameter", "value"], transformation_ctx = "selectfields2"]
## @return: selectfields2
## @inputs: [frame = applymapping1]
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["product_id", "parameter", "value"], transformation_ctx = "selectfields2")

## @type: ResolveChoice
## @args: [choice = "MATCH_CATALOG", database = "mainclouddb", table_name = "mysqlmaincloud_product_parameter_mapping", transformation_ctx = "resolvechoice3"]
## @return: resolvechoice3
## @inputs: [frame = selectfields2]
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mainclouddb", table_name = "mysqlmaincloud_product_parameter_mapping", transformation_ctx = "resolvechoice3")

## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice4"]
## @return: resolvechoice4
## @inputs: [frame = resolvechoice3]
resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")

## @type: DataSink
## @args: [database = "mainclouddb", table_name = "mysqlmaincloud_product_parameter_mapping", transformation_ctx = "datasink5"]
## @return: datasink5
## @inputs: [frame = resolvechoice4]
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "mainclouddb", table_name = "mysqlmaincloud_product_parameter_mapping", transformation_ctx = "datasink5")

job.commit()

Can somebody help me to add this new column to my data frame so that it can be made available in the table?有人可以帮我将这个新列添加到我的数据框中,以便它可以在表中使用吗?

Thanks谢谢

One way to add columns to a dynamicframe directly without converting a spark dataframe in between is to use a Map transformation (note that this is different from ApplyMapping).一种直接向动态帧添加列而不转换其间的 spark dataframe 的方法是使用Map 转换(请注意,这与 ApplyMapping 不同)。

So let's assume that your input dynframe (with the data looking like in your example row) is called dyf_in .因此,让我们假设您的输入动态帧(数据看起来像您的示例行)称为dyf_in

You can do something like the following to create 2 separate dynamicframes, one with the cost entries, and another with the data entries:您可以执行类似以下操作来创建 2 个单独的动态帧,一个包含成本条目,另一个包含数据条目:

from awsglue.gluetypes import _create_dynamic_record
def getCosts(rec):
  return _create_dynamic_record({
    'Product_id':rec['id'],
    'Parameter':'cost',
    'value':rec['cost'
  }
def getDatas(rec):
  return _create_dynamic_record({
    'Product_id':rec['id'],
    'Parameter':'data',
    'value':rec['data']
  }

dyf_costs = Map.apply(frame=dyf_in, f=getCosts, transformation_ctx='dyf_costs')
dyf_datas = Map.apply(frame=dyf_in, f=getDatas, transformation_ctx='dyf_datas')

And then you either push those dynamicframes into the same sink, or use something like Join (after adding an extra column in the Map funcs to use as a unique join key, and then dropping it afterwards) to concatenate the two dynamicframes into a single one.然后你要么将这些动态帧推入同一个接收器,要么使用类似Join的东西(在 Map 函数中添加一个额外的列用作唯一的连接键,然后将其删除)将两个动态帧连接成一个.

One thing I'm not sure if Glue is able to do (at least with Map ) is do this sort of a transpose (which is what you're sort of trying to do?) directly without running through the same dynamicframe twice as my example does.我不确定 Glue 是否能够做的一件事(至少使用Map )是直接进行这种转置(这就是你想要做的事情?),而无需像我一样两次运行相同的动态帧例子确实如此。

Glue Databrew seems to have some sort of a transpose function available, but I don't know much about Databrew, and maybe it's not even applicable to your situation, so I won't comment on that further. Glue Databrew似乎有某种转置 function 可用,但我对 Databrew 了解不多,也许它甚至不适用于你的情况,所以我不会进一步评论。

For a smaller datsframe you can do the following对于较小的数据框,您可以执行以下操作

  1. convert the dynamic frame to spark dataframe将动态帧转换为火花 dataframe
  2. add a column添加一列
  3. convert back to dynamic frame转换回动态帧

step 1步骤1

datasource0 = datasource0.toDF()

step 2第2步

from pyspark.sql.functions import udf
getNewValues = udf(lambda val: val+1) # you can do what you need to do here instead of val+1

datasource0 = datasource0.withColumn('New_Col_Name', getNewValues(col('some_existing_col'))

step 3第 3 步

datasource0 = DynamicFrame.fromDF(datasource0, glueContext)

The issue is when you have a large dataset the operation toDF() is very expensive!问题是当您拥有大型数据集时,toDF() 操作非常昂贵!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM