Python 將 GCS 中的 json 文件並行讀取到 pandas DF 中

Question

TL；DR： asyncio 、 multi-processing 、 threading和some other solution ，以並行化從 GCS 讀取文件的循環，然后將此數據一起附加到 pandas dataframe，然后寫入 BigQuery...

I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.

這是 function 的非並行版本：

import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):

    # my own function to get list of filenames from GCS directory
    files = get_gcs_file_list(directory=gcs_directory) # 

    # Create new table
    output_df = pd.DataFrame()
    fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
    counter = 0
    for file in files:

        # read files from GCS
        with fs.open(file, 'r') as f:
            gcs_data = json.loads(f.read())
            data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
            this_df = pd.DataFrame(data)
            output_df = output_df.append(this_df)

        # Write to BigQuery for every 5K rows of data
        counter += 1
        if (counter % 5000 == 0):
            pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
            output_df = pd.DataFrame() # and reset the dataframe


    # Write remaining rows to BigQuery
    pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')

這個 function 很簡單：

抓取['gcs_dir/file1.json', 'gcs_dir/file2.json', ...] ，GCS 中的文件名列表
循環遍歷每個文件名，並且：
- 從 GCS 讀取文件
- 將數據轉換為 pandas DF
- 附加到主 pandas DF
- 每 5K 循環，寫入 BigQuery（因為隨着 DF 變大，附加變得更慢）

我必須在幾個 GCS 目錄上運行這個 function，每個目錄都有 ~500K 文件。 由於讀取/寫入這么多小文件的瓶頸，單個目錄的這個過程大約需要 24 小時......如果我可以讓它更加並行以加快速度，那就太好了，因為這似乎是一項任務適合並行化。

編輯：下面的解決方案很有幫助，但我對在 python 腳本中並行運行特別感興趣。 Pandas 正在處理一些數據清理，使用bq load會拋出錯誤。 有asyncio和這個gcloud-aio-storage似乎都可能對這項任務有用，可能是比線程或多處理更好的選擇......

Answer 1

與其向 python 代碼添加並行處理，不如考慮並行多次調用 python 程序。 這是一個技巧，它更容易適用於在命令行上獲取文件列表的程序。 因此，為了這篇文章，讓我們考慮更改程序中的一行：

您的線路：

# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #

新隊：

files = sys.argv[1:]  # ok, import sys, too

現在，您可以通過這種方式調用您的程序：

PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program

xargs現在將通過get_gcs_file_list.py獲取文件名 output 並並行調用your_program多達 100 次，在每行上安裝盡可能多的文件名。 我相信文件名的數量僅限於 shell 允許的最大命令大小。 如果 100 個進程不足以處理所有文件，則 xargs 將再次調用your_program （一次又一次），直到它從標准輸入讀取的所有文件名都被處理。 xargs確保同時運行的your_program調用不超過 100 個。 您可以根據主機可用的資源來改變進程數。

Answer 2

而不是這樣做，您可以直接使用bq命令。

bq 命令行工具是基於 Python 的 BigQuery 命令行工具。

當您使用此命令時，加載發生在 google 的網絡中，這比我們創建 dataframe 並加載到表中要快得多。

    bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/my_json_folder/*.json

有關更多信息 - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table

Python 將 GCS 中的 json 文件並行讀取到 pandas DF 中

問題描述

2 個解決方案

解決方案1
2 2020-07-23 02:03:18

解決方案2
2 2020-07-23 04:13:05

Python 將 GCS 中的 json 文件並行讀取到 pandas DF 中

問題描述

2 個解決方案

解決方案1 2 2020-07-23 02:03:18

解決方案2 2 2020-07-23 04:13:05

解決方案1
2 2020-07-23 02:03:18

解決方案2
2 2020-07-23 04:13:05