簡體   English   中英

如何運行 BigQuery 查詢,然后將 output CSV 發送到 Apache ZD16624521E6B89823 中的 Google Cloud Storage

[英]How to run a BigQuery query and then send the output CSV to Google Cloud Storage in Apache Airflow?

我需要在 python 中運行一個 bigquery 腳本,該腳本需要 output 作為谷歌雲存儲中的 CSV。 目前,我的腳本觸發大查詢代碼並直接保存到我的電腦。

但是,我需要在 Airflow 中運行它,所以我不能有任何本地依賴項。

我當前的腳本將 output 保存到我的本地計算機,然后我必須將其移動到 GCS。 上網查了一下,搞不明白。 (我對 python 很陌生,所以如果之前有人問過這個問題,我提前道歉!)

import pandas as pd
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

def run_script():

    df = pd.read_gbq('SELECT * FROM `table/veiw` LIMIT 15000',
                 project_id='PROJECT',
                 dialect='standard'
                 )

    df.to_csv('XXX.csv', index=False)

def copy_to_gcs(filename, bucket, destination_filename):

    credentials = GoogleCredentials.get_application_default()
    service = discovery.build('storage', 'v1', credentials=credentials)

    body = {'name': destination_filename}
    req = service.objects().insert(bucket=bucket,body=body, media_body=filename)
    resp = req.execute()

current_date = datetime.date.today()
filename = (r"C:\Users\LOCALDRIVE\ETC\ETC\ETC.csv")
bucket = 'My GCS BUCKET'

str_prefix_datetime = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
destfile = 'XXX' + str_prefix_datetime + '.csv'
print('')

    ```

Airflow 提供了多個用於使用 BigQuery 的運算符。

您可以在 Cloud Composer 代碼示例中查看運行查詢的示例,然后將結果導出到 CSV

# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Query recent StackOverflow questions.

bq_recent_questions_query = bigquery_operator.BigQueryOperator(
    task_id='bq_recent_questions_query',
    sql="""
    SELECT owner_display_name, title, view_count
    FROM `bigquery-public-data.stackoverflow.posts_questions`
    WHERE creation_date < CAST('{max_date}' AS TIMESTAMP)
        AND creation_date >= CAST('{min_date}' AS TIMESTAMP)
    ORDER BY view_count DESC
    LIMIT 100
    """.format(max_date=max_query_date, min_date=min_query_date),
    use_legacy_sql=False,
    destination_dataset_table=bq_recent_questions_table_id)

# Export query result to Cloud Storage.
export_questions_to_gcs = bigquery_to_gcs.BigQueryToCloudStorageOperator(
    task_id='export_recent_questions_to_gcs',
    source_project_dataset_table=bq_recent_questions_table_id,
    destination_cloud_storage_uris=[output_file],
    export_format='CSV')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM