简体   繁体   中英

Airflow BigQueryGetDataOperator max_results parameter doesn't work

import logging
from datetime import datetime, timedelta

from airflow.utils import dates
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.bigquery_get_data import BigQueryGetDataOperator




default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': dates.days_ago(2),
}

dag = DAG(
    dag_id='bigQueryPipeline',
    default_args=default_args,
    schedule_interval='0 0 * * *'
)


t1 = BigQueryGetDataOperator(
    task_id='bigquery_test',
    dataset_id= <my-dataset-name>,
    table_id= <my-table-id>,
    max_results='2',
)


def print_context(**context):
    import time
    import json
    xcom_pull = context['ti'].xcom_pull(task_ids='bigquery_test')
    logging.info('logging ', json.dumps(xcom_pull))


t2 = PythonOperator(
    task_id='print_result',
    python_callable=print_context,
    provide_context=True,
    dag=dag
)

t1 >> t2

if __name__ == "__main__":
    dag.cli()

So, this is my DAG. I'm testing getting data from a BigQuery table. Everything works except for the max_results argument, which is in the docs .

As I can see in the logs:

[2019-11-26 14:46:02,272] {bigquery_get_data.py:92} INFO - Fetching Data from:
[2019-11-26 14:46:02,272] {bigquery_get_data.py:94} INFO - Dataset: <my-dataset> ; Table: <my-table> ; Max Results: 2
[2019-11-26 14:46:02,291] {logging_mixin.py:112} INFO - [2019-11-26 14:46:02,291] {gcp_api_base_hook.py:145} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2019-11-26 14:46:02,309] {logging_mixin.py:112} INFO - [2019-11-26 14:46:02,309] {discovery.py:271} INFO - URL being requested: GET https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest
[2019-11-26 14:46:02,412] {logging_mixin.py:112} INFO - [2019-11-26 14:46:02,412] {discovery.py:867} INFO - URL being requested: GET https://bigquery.googleapis.com/bigquery/v2/projects/<my-project>/datasets/<my-dataset>tables/<my-table>/data?maxResults=2&alt=json
[2019-11-26 14:46:02,851] {bigquery_get_data.py:106} INFO - Total Extracted rows: 77374

Notice the Max Results: 2 in the 2nd line, and the ?maxResults=2 querystring in the 5th line. Besides that, Total Extracted rows: 77374 in the last line.

I'm guessing maybe it's a bigquery api bug?

Do any of you guys know how can I report this to Airflow? And to Google?

edit : found where to submit bug reports for airflow .

I don't think it's a Bigquery error, the result you are seeing is the result of all the data processed by the instruction BigQueryGetDataOperator .

Therefore, if you want to see the result of the second step t2 , you should look for a completed task (dark green).

在此处输入图像描述

Now, click over your Task Id " print_result " of your Dag.

在此处输入图像描述

This way you will be able to see the logs of this step with the result.

In addition, you can check the result of the first step selecting one completed task, like in the previous step, but this time with Task Id " bigquery_test ".

在此处输入图像描述

Then, click on the " XCom " button, and here you will see the two results returned.

Here are the result from my test:

processed rows:

在此处输入图像描述

results:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM