使用雲存儲中的雲功能重寫 csv 文件並將其發送到 BigQuery

Question

我正在編寫一個小型雲函數 python 腳本來重寫來自存儲的 csv 文件（跳過一些列）並將其發送到 BigQuery。

我的腳本的 BigQuery 部分是這樣的：

def bq_import(request):
    job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    uri = "gs://url.appspot.com/fil.csv"

    load_job = bq_client.load_table_from_uri(
        uri, dataset_ref.table('table'), job_config=job_config
    )  # API request

    load_job.result()  # Waits for table load to complete.

    destination_table = bq_client.get_table(dataset_ref.table('table'))

我找到了這個腳本，它允許我通過跳過一些列來重寫 csv：

def remove_csv_columns(input_csv, output_csv, exclude_column_indices):
    with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
        reader = csv.reader(file_in)
        writer = csv.writer(file_out)
        writer.writerows(
            [col for idx, col in enumerate(row)
             if idx not in exclude_column_indices]
            for row in reader)

remove_csv_columns('in.csv', 'out.csv', (3, 4))

所以我基本上需要讓這兩個腳本在我的雲功能中協同工作。 但是我不確定我應該如何處理remove_csv_columns函數，尤其是output_csv變量。 我應該創建一個空的虛擬 csv 文件嗎？ 或者一個數組或類似的東西？ 如何即時重寫此 csv 文件？

我認為我的最終腳本應該是這樣的，但是缺少一些東西......

uri = "gs://url.appspot.com/fil.csv"

def remove_csv_columns(uri, output_csv, exclude_column_indices):
    with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
    reader = csv.reader(file_in)
    writer = csv.writer(file_out)
    writer.writerows(
            [col for idx, col in enumerate(row)
            if idx not in exclude_column_indices]
            for row in reader)

def bq_import(request):
    job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    csv_file = remove_csv_columns('in.csv', 'out.csv', (3, 4))

    load_job = bq_client.load_table_from_uri(
        csv_file, dataset_ref.table('table'), job_config=job_config
    )  # API request

    load_job.result()  # Waits for table load to complete.

    destination_table = bq_client.get_table(dataset_ref.table('table'))

基本上我認為我需要通過remove_csv_columns來在 bq_import 函數中定義我的 cvs 文件，但我不確定如何。

順便說一下，我正在學習 Python，但我不是開發專家。 謝謝。

Answer 1

你的代碼有很多問題，我會盡量在我的更正中說清楚

uri = "gs://url.appspot.com/fil.csv"

我不知道您的函數是如何觸發的，但通常要處理的文件包含在這樣的request對象中，用於來自 GCS 的事件。 使用存儲桶和名稱動態構建您的uri

def remove_csv_columns(uri, output_csv, exclude_column_indices):
    with open(input_csv) as file_in, open(output_csv, 'w') as file_out:

請注意：您使用uri作為函數參數名稱，並使用input_csv在讀取模式下打開您的輸入文件。 這里您的代碼崩潰，因為input_csv不存在！

這里再提一句。 uri是函數參數名稱，僅在函數內部已知，與外部關系，除了填充此值的調用者。 它與您在uri = "gs://url.appspot.com/fil.csv"之前定義的全局變量完全沒有鏈接

    reader = csv.reader(file_in)
    writer = csv.writer(file_out)
    writer.writerows(
            [col for idx, col in enumerate(row)
            if idx not in exclude_column_indices]
            for row in reader)

def bq_import(request):
    job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    csv_file = remove_csv_columns('in.csv', 'out.csv', (3, 4))

您的輸入文件是靜態的。 閱讀我關於動態uri構建的評論。

看看函數remove_csv_columns ：它什么都不返回，它只是在out.csv寫入一個新文件。 因此，您的csv_file在這里不代表任何內容。 另外，這個函數有什么作用？ 讀取in.csv文件並寫入out.csv文件（通過刪除列）。 您必須將文件傳遞給此函數

順便說一下，您必須從 Cloud Storage 下載文件並將其存儲在本地。 在 Cloud Function 中，只有/tmp是可寫的。 因此你的代碼應該看起來像這樣

# create storage client
storage_client = storage.Client()
# get bucket with name
bucket = storage_client.get_bucket('<your bucket>')
# get bucket data as blob
blob = bucket.get_blob('<your full file name, path included')
# convert to string
data = blob.download_as_string()
# write the file
with open('/tmp/input.csv', 'w') as file_out:
  file_out.write(data )
remove_csv_columns('/tmp/input.csv', '/tmp/out.csv', (3, 4))

繼續你的代碼

    load_job = bq_client.load_table_from_uri(
        csv_file, dataset_ref.table('table'), job_config=job_config
    )  # API request

函數load_table_from_uri將文件從 Cloud Storage 中的文件加載到 BigQuery。 在這里，它不是您的目標，您想將本地創建的文件out.csv加載到函數中。 正確的調用是load_job = bq_client.load_table_from_file(open('/tmp/out.csv', 'rb'), job_config=job_config)

    load_job.result()  # Waits for table load to complete.

    destination_table = bq_client.get_table(dataset_ref.table('table'))

然后，考慮清理/tmp目錄以釋放內存，注意 Cloud Function 超時，在您的requirements.txt文件中導入正確的庫（至少是 Cloud Storage 和 BigQuery 依賴項），最后為您的角色分擔雲功能

然而，這只是為了提高你的 Python 代碼和技能。 無論如何，這個功能是沒有用的。

確實，使用雲功能，如前所述，您只能在/tmp目錄中寫入。 它是一個內存文件系統，Cloud Function 僅限於 2Gb 內存（包括文件和執行代碼占用空間）。 順便說一句，輸入文件的大小不能大於 800Mb 左右，對於小文件，更容易。

將原始文件加載到 BigQuery 的臨時表中
像這樣對您想要的列執行INSERT SELECT BigQuery 查詢

INSERT INTO `<dataset>.<table>`
SELECT * except (<column to ignore>) from `<dataset>.<temporary table>`

刪除臨時表

由於您的文件很小（小於 1GB），並且由於 BigQuery 免費套餐掃描了 5TB 的數據（您只需支付掃描的數據而不是處理，免費完成您想要的所有 SQL 轉換），因此更容易將數據處理為BigQuery 比 Python 中的多。

函數處理時間會更長，您可以為函數而不是 BigQuery 支付處理時間。

使用雲存儲中的雲功能重寫 csv 文件並將其發送到 BigQuery

問題描述

1 個解決方案

解決方案1
2 已采納 2019-12-17 05:22:07

使用雲存儲中的雲功能重寫 csv 文件並將其發送到 BigQuery

問題描述

1 個解決方案

解決方案1 2 已采納 2019-12-17 05:22:07

解決方案1
2 已采納 2019-12-17 05:22:07