简体   繁体   English

AWS胶水/ pyspark-如何使用Glue以编程方式创建Athena表

[英]aws glue / pyspark - how to create Athena table programmatically using Glue

I am running a script in AwsGlue which loads the data from s3, does some transformation and saves the results to S3. 我正在AwsGlue中运行一个脚本,该脚本从s3加载数据,进行一些转换并将结果保存到S3。 I am trying to add one more step to this routine. 我正尝试在此例程中再增加一步。 I want to create a new table in an existing database in Athena. 我想在雅典娜的现有数据库中创建一个新表。

I cannot find any similar example in AWS documentation. 我在AWS文档中找不到任何类似的示例。 The results are just written down to S3 in the examples I came across. 在我遇到的示例中,结果仅记录到S3中。 Is this possible in Glue? 胶水有可能吗?

There is some example of the code. 有一些代码示例。 How should it be modified to create the Athena table with the output results? 如何修改它以创建带有输出结果的Athena表?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SQLContext
from pyspark.sql.types import *


args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dataset", table_name = "table_1", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"), ("description", "string", "description", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://..."}, format = "parquet", transformation_ctx = "datasink4")


*create Athena table with the output results*

job.commit()

I can think of two ways to do this. 我可以想到两种方法来做到这一点。 One is using the sdk to get a reference to the athena API and use it to execute a query with the create table statement, as seen at this blog post 一种方法是使用sdk获取对athena API的引用,并使用其通过create table语句执行查询, 如本博文所示

An alternative way which might be more interesting is using the Glue API to create a crawler for your S3 bucket and then execute the crawler. 一种更有趣的替代方法是使用Glue API为您的S3存储桶创建一个搜寻器 ,然后执行该搜寻器。

With the second approach your table is catalogued and you can use it not only from athena, but also from EMR, or Redshift spectrum. 通过第二种方法,您的表被分类了,不仅可以从雅典娜使用它, 还可以从EMR或Redshift频谱中使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM