Python 将 GCS 中的 json 文件并行读取到 pandas DF 中

Question

TL；DR： asyncio 、 multi-processing 、 threading和some other solution ，以并行化从 GCS 读取文件的循环，然后将此数据一起附加到 pandas dataframe，然后写入 BigQuery...

I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.

这是 function 的非并行版本：

import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):

    # my own function to get list of filenames from GCS directory
    files = get_gcs_file_list(directory=gcs_directory) # 

    # Create new table
    output_df = pd.DataFrame()
    fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
    counter = 0
    for file in files:

        # read files from GCS
        with fs.open(file, 'r') as f:
            gcs_data = json.loads(f.read())
            data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
            this_df = pd.DataFrame(data)
            output_df = output_df.append(this_df)

        # Write to BigQuery for every 5K rows of data
        counter += 1
        if (counter % 5000 == 0):
            pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
            output_df = pd.DataFrame() # and reset the dataframe


    # Write remaining rows to BigQuery
    pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')

这个 function 很简单：

抓取['gcs_dir/file1.json', 'gcs_dir/file2.json', ...] ，GCS 中的文件名列表
循环遍历每个文件名，并且：
- 从 GCS 读取文件
- 将数据转换为 pandas DF
- 附加到主 pandas DF
- 每 5K 循环，写入 BigQuery（因为随着 DF 变大，附加变得更慢）

我必须在几个 GCS 目录上运行这个 function，每个目录都有 ~500K 文件。 由于读取/写入这么多小文件的瓶颈，单个目录的这个过程大约需要 24 小时......如果我可以让它更加并行以加快速度，那就太好了，因为这似乎是一项任务适合并行化。

编辑：下面的解决方案很有帮助，但我对在 python 脚本中并行运行特别感兴趣。 Pandas 正在处理一些数据清理，使用bq load会抛出错误。 有asyncio和这个gcloud-aio-storage似乎都可能对这项任务有用，可能是比线程或多处理更好的选择......

Answer 1

与其向 python 代码添加并行处理，不如考虑并行多次调用 python 程序。 这是一个技巧，它更容易适用于在命令行上获取文件列表的程序。 因此，为了这篇文章，让我们考虑更改程序中的一行：

您的线路：

# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #

新队：

files = sys.argv[1:]  # ok, import sys, too

现在，您可以通过这种方式调用您的程序：

PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program

xargs现在将通过get_gcs_file_list.py获取文件名 output 并并行调用your_program多达 100 次，在每行上安装尽可能多的文件名。 我相信文件名的数量仅限于 shell 允许的最大命令大小。 如果 100 个进程不足以处理所有文件，则 xargs 将再次调用your_program （一次又一次），直到它从标准输入读取的所有文件名都被处理。 xargs确保同时运行的your_program调用不超过 100 个。 您可以根据主机可用的资源来改变进程数。

Answer 2

而不是这样做，您可以直接使用bq命令。

bq 命令行工具是基于 Python 的 BigQuery 命令行工具。

当您使用此命令时，加载发生在 google 的网络中，这比我们创建 dataframe 并加载到表中要快得多。

    bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/my_json_folder/*.json

有关更多信息 - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table

Python 将 GCS 中的 json 文件并行读取到 pandas DF 中

问题描述

2 个解决方案

解决方案1
2 2020-07-23 02:03:18

解决方案2
2 2020-07-23 04:13:05

Python 将 GCS 中的 json 文件并行读取到 pandas DF 中

问题描述

2 个解决方案

解决方案1 2 2020-07-23 02:03:18

解决方案2 2 2020-07-23 04:13:05

解决方案1
2 2020-07-23 02:03:18

解决方案2
2 2020-07-23 04:13:05