简体   繁体   English

由于时间问题,Athena 查询返回空结果

[英]Athena query return empty result because of timing issues

I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.我正在尝试根据 S3 中的数据创建和查询 Athena 表,似乎存在一些时间问题。

How can I know when all the partitions have been loaded to the table?我如何知道所有分区何时都已加载到表中?

The following code returns an empty result -以下代码返回一个空结果 -

athena_client.start_query_execution(QueryString=app_query_create_table,
                                    ResultConfiguration={'OutputLocation': output_location})

athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
                .format(athena_db=athena_db, athena_db_partition=athena_db_partition),
            ResultConfiguration={'OutputLocation': output_location})

result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)

But when I add some delay, it works greate -但是当我添加一些延迟时,它的效果会更好-

athena_client.start_query_execution(QueryString=app_query_create_table,
                                    ResultConfiguration={'OutputLocation': output_location})

athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
                .format(athena_db=athena_db, athena_db_partition=athena_db_partition),
            ResultConfiguration={'OutputLocation': output_location})

time.sleep(3)

result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)

The following is the query for creating the table -以下是创建表的查询 -

query_create_table = '''
            CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
                `time` string,
                `user_advertiser_id` string,
                `predictions` float
            ) PARTITIONED BY (
                dt string
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
            WITH SERDEPROPERTIES (
                'serialization.format' = ',',
                'field.delim' = ','
            ) LOCATION 's3://{bucket}/path/'
            '''

app_query_create_table = query_create_table.format(bucket=bucket,
                                                   athena_db=athena_db,
                                                     athena_db_partition=athena_db_partition)

I would love to get some help.我很想得到一些帮助。

The start_query_execution call only starts the query, it does not wait for it to complete. start_query_execution调用仅启动查询,它不等待它完成。 You must run get_query_execution periodically until the status of the execution is successful (or failed).您必须定期运行get_query_execution直到执行状态成功(或失败)。


Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.与您的问题本身无关,但是如果您使用CREATE TABLE … AS因为不需要使用MSCK REPAIR TABLE …之后,以这种方式创建表后将没有分区 - 因为它将使用查询生成的所有分区创建。

Also, in general, avoid using MSCK REPAIR TABLE , it is slow and inefficient.另外,一般来说,避免使用MSCK REPAIR TABLE ,它既慢又低效。 There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/有许多更好的方法可以将分区添加到表中,请参阅https://athena.guide/articles/five-ways-to-add-partitions/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM