在對 Google Cloud Bucket 執行一些 ETL 后使用 to_csv

Question

我想知道是否有人可以提供幫助。 我正在嘗試從 GCP 存儲桶中獲取 CSV，將其運行到 dataframe，然后是 output 將文件放入項目中運行的另一個存儲桶，但我指定的方法沒有使用此方法桶？ 我的 dag 只需要很長時間才能運行。 對這個問題有任何見解嗎？

import gcsfs
from airflow.operators import python_operator
from airflow import models
import pandas as pd
import logging
import csv
import datetime


fs = gcsfs.GCSFileSystem(project='project-goes-here')
with fs.open('gs://path/file.csv') as f:
    gas_data = pd.read_csv(f)


def make_csv():
    # Creates the CSV file with a datetime with no index, and adds the map, collection and collection address to the CSV
    # Calisto changed their mind on the position of where the conversion factor and multiplication factor should go
    gas_data['Asset collection'] = 'Distribution'
    gas_data['Asset collection address 1'] = 'Distribution'
    gas_data['Asset collection address 2'] = 'Units1+2 Central City'
    gas_data['Asset collection address 3'] = 'ind Est'
    gas_data['Asset collection city'] = 'Coventry'
    gas_data['Asset collection postcode'] = 'CV6 5RY'
    gas_data['Multiplication Factor'] = '1.000'
    gas_data['Conversion Factor'] = '1.022640'
    gas_data.to_csv('gs://path/'
                'Clean_zenos_data_' + datetime.datetime.today().strftime('%m%d%Y%H%M%S''.csv'), index=False,
                quotechar='"', sep=',', quoting=csv.QUOTE_NONNUMERIC)
                logging.info('Added Map, Asset collection, Asset collection address and Saved CSV')

    make_csv_function = python_operator.PythonOperator(
    task_id='make_csv',
    python_callable=make_csv
)

Answer 1

不確定我是否理解正確，但您似乎將您的PythonOperator創建嵌套在make_csv依賴項中，據我所知，這是一個無限循環。 也許嘗試在 function 之外移除它，看看會發生什么？

Answer 2

還有一個問題是您正在讀取任何任務/python 可調用 function 之外的 csv 文件。 Airflow 會在每次心跳時讀取該文件（我相信 1 分鍾），這不好。 也許您可以將讀取 csv 移動到make_csv() function 內部，而且我可以在您的代碼中看到一些縮進錯誤。

在對 Google Cloud Bucket 執行一些 ETL 后使用 to_csv

問題描述

2 個解決方案

解決方案1
0 2021-11-18 16:19:15

解決方案2
0 2021-11-18 16:30:54

在對 Google Cloud Bucket 執行一些 ETL 后使用 to_csv

問題描述

2 個解決方案

解決方案1 0 2021-11-18 16:19:15

解決方案2 0 2021-11-18 16:30:54

解決方案1
0 2021-11-18 16:19:15

解決方案2
0 2021-11-18 16:30:54