简体   繁体   English

在 AWS Glue Python Shell 中查询 Athena 表

[英]Querying Athena tables in AWS Glue Python Shell

Python Shell Jobs was introduced in AWS Glue. AWS Glue 中引入了 Python Shell 作业。 They mentioned:他们提到:

You can now use Python shell jobs, for example, to submit SQL queries to services such as ... Amazon Athena ...例如,您现在可以使用 Python shell 作业将 SQL 查询提交给诸如 ... Amazon Athena ... 之类的服务。

Ok.好的。 We have an example to read data from Athena tables here :我们有一个例子来读取雅典娜表中的数据在这里

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

persons = glueContext.create_dynamic_frame.from_catalog(
             database="legislators",
             table_name="persons_json")
print("Count: ", persons.count())
persons.printSchema()
# TODO query all persons

However, it uses Spark instead of Python Shell.但是,它使用 Spark 而不是 Python Shell。 There are no such libraries that are normally available with Spark job type and I have an error: Spark作业类型通常没有这样的库,我有一个错误:

ModuleNotFoundError: No module named 'awsglue.transforms' ModuleNotFoundError:没有名为“awsglue.transforms”的模块

How can I rewrite the code above to make it executable in the Python Shell job type?如何重写上面的代码以使其在 Python Shell 作业类型中可执行?

The thing is, Python Shell type has its own limited set of built-in libraries .问题是,Python Shell 类型有自己有限的内置库集

I only managed to achieve my goal using Boto 3 to query data and Pandas to read it into a dataframe.我只设法使用Boto 3查询数据和Pandas将其读入数据帧来实现我的目标。

Here is the code snippet:这是代码片段:

import boto3
import pandas as pd

s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
athena_client = boto3.client(service_name='athena', region_name='us-east-1')
bucket_name = 'bucket-with-csv'
print('Working bucket: {}'.format(bucket_name))

def run_query(client, query):
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={ 'Database': 'sample-db' },
        ResultConfiguration={ 'OutputLocation': 's3://{}/fromglue/'.format(bucket_name) },
    )
    return response

def validate_query(client, query_id):
    resp = ["FAILED", "SUCCEEDED", "CANCELLED"]
    response = client.get_query_execution(QueryExecutionId=query_id)
    # wait until query finishes
    while response["QueryExecution"]["Status"]["State"] not in resp:
        response = client.get_query_execution(QueryExecutionId=query_id)

    return response["QueryExecution"]["Status"]["State"]

def read(query):
    print('start query: {}\n'.format(query))
    qe = run_query(athena_client, query)
    qstate = validate_query(athena_client, qe["QueryExecutionId"])
    print('query state: {}\n'.format(qstate))

    file_name = "fromglue/{}.csv".format(qe["QueryExecutionId"])
    obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
    return pd.read_csv(obj['Body'])

time_entries_df = read('SELECT * FROM sample-table')

SparkContext won't be available in Glue Python Shell. SparkContext 在 Glue Python Shell 中不可用。 Hence you need to depend on Boto3 and Pandas to handle the data retrieval.因此,您需要依赖 Boto3 和 Pandas 来处理数据检索。 But it comes a lot of overhead to query Athena using boto3 and poll the ExecutionId to check if the query execution got finished.但是使用 boto3 查询 Athena 并轮询 ExecutionId 以检查查询执行是否完成会带来很多开销。

Recently awslabs released a new package called AWS Data Wrangler.最近 awslabs 发布了一个名为 AWS Data Wrangler 的新包。 It extends power of Pandas library to AWS to easily interact with Athena and lot of other AWS Services.它将 Pandas 库的功能扩展到 AWS,以轻松与 Athena 和许多其他 AWS 服务进行交互。

Reference link:参考链接:

  1. https://github.com/awslabs/aws-data-wrangler https://github.com/awslabs/aws-data-wrangler
  2. https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb

Note: AWS Data Wrangler library wont be available by default inside Glue Python shell.注意:默认情况下,AWS Data Wrangler 库在 Glue Python shell 中不可用。 To include it in Python shell, follow the instructions in following link:要将其包含在 Python shell 中,请按照以下链接中的说明进行操作:

https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs

I have a few month using glue, i use:我有几个月使用胶水,我使用:

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

data_frame = spark.read.format("com.databricks.spark.csv")\
    .option("header","true")\
    .load(<CSVs THAT IS USING FOR ATHENA - STRING>)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM