简体   繁体   English

Python 将 GCS 中的 json 文件并行读取到 pandas DF 中

[英]Python read .json files from GCS into pandas DF in parallel

TL;DR: asyncio vs multi-processing vs threading vs. some other solution to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery... TL;DR: asynciomulti-processingthreadingsome other solution ,以并行化从 GCS 读取文件的循环,然后将此数据一起附加到 pandas dataframe,然后写入 BigQuery...

I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table. I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.

Here is a not-parallel version of the function:这是 function 的非并行版本:

import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):

    # my own function to get list of filenames from GCS directory
    files = get_gcs_file_list(directory=gcs_directory) # 

    # Create new table
    output_df = pd.DataFrame()
    fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
    counter = 0
    for file in files:

        # read files from GCS
        with fs.open(file, 'r') as f:
            gcs_data = json.loads(f.read())
            data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
            this_df = pd.DataFrame(data)
            output_df = output_df.append(this_df)

        # Write to BigQuery for every 5K rows of data
        counter += 1
        if (counter % 5000 == 0):
            pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
            output_df = pd.DataFrame() # and reset the dataframe


    # Write remaining rows to BigQuery
    pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')

This function is straightforward:这个 function 很简单:

  • grab ['gcs_dir/file1.json', 'gcs_dir/file2.json', ...] , the list of file names in GCS抓取['gcs_dir/file1.json', 'gcs_dir/file2.json', ...] ,GCS 中的文件名列表
  • loop over each file name, and:循环遍历每个文件名,并且:
    • read the file from GCS从 GCS 读取文件
    • converts the data into a pandas DF将数据转换为 pandas DF
    • appends to a main pandas DF附加到主 pandas DF
    • every 5K loops, write to BigQuery (since the appends get much slower as the DF gets larger)每 5K 循环,写入 BigQuery(因为随着 DF 变大,附加变得更慢)

I have to run this function on a few GCS directories each with ~500K files.我必须在几个 GCS 目录上运行这个 function,每个目录都有 ~500K 文件。 Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.由于读取/写入这么多小文件的瓶颈,单个目录的这个过程大约需要 24 小时......如果我可以让它更加并行以加快速度,那就太好了,因为这似乎是一项任务适合并行化。

Edit : The solutions below are helpful, but I am particularly interested in running in parallel from within the python script.编辑:下面的解决方案很有帮助,但我对在 python 脚本中并行运行特别感兴趣。 Pandas is handling some data cleaning, and using bq load will throw errors. Pandas 正在处理一些数据清理,使用bq load会抛出错误。 There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...asyncio和这个gcloud-aio-storage似乎都可能对这项任务有用,可能是比线程或多处理更好的选择......

Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel.与其向 python 代码添加并行处理,不如考虑并行多次调用 python 程序。 This is a trick that lends itself more easily to a program that takes a list of files on the command line.这是一个技巧,它更容易适用于在命令行上获取文件列表的程序。 So, for the sake of this post, let's consider changing one line in your program:因此,为了这篇文章,让我们考虑更改程序中的一行:

Your line:您的线路:

# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) # 

New line:新队:

files = sys.argv[1:]  # ok, import sys, too

Now, you can invoke your program this way:现在,您可以通过这种方式调用您的程序:

PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program

xargs will now take the file names output by get_gcs_file_list.py and invoke your_program up to 100 times in parallel, fitting as many file names as it can on each line. xargs现在将通过get_gcs_file_list.py获取文件名 output 并并行调用your_program多达 100 次,在每行上安装尽可能多的文件名。 I believe the number of file names is limited to the maximum command size allowed by the shell.我相信文件名的数量仅限于 shell 允许的最大命令大小。 If 100 processes is not enough to process all your files, xargs will invoke your_program again (and again) until all file names it reads from stdin are processed.如果 100 个进程不足以处理所有文件,则 xargs 将再次调用your_program (一次又一次),直到它从标准输入读取的所有文件名都被处理。 xargs ensures that no more than 100 invocations of your_program are running simultaneously. xargs确保同时运行的your_program调用不超过 100 个。 You can vary the number of processes based on the resources available to your host.您可以根据主机可用的资源来改变进程数。

Instead of doing this, you can directly use bq command.而不是这样做,您可以直接使用bq命令。

The bq command-line tool is a Python-based command-line tool for BigQuery. bq 命令行工具是基于 Python 的 BigQuery 命令行工具。

When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.当您使用此命令时,加载发生在 google 的网络中,这比我们创建 dataframe 并加载到表中要快得多。

    bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/my_json_folder/*.json

For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table有关更多信息 - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM