简体   繁体   中英

Python read .json files from GCS into pandas DF in parallel

TL;DR: asyncio vs multi-processing vs threading vs. some other solution to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery...

I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.

Here is a not-parallel version of the function:

import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):

    # my own function to get list of filenames from GCS directory
    files = get_gcs_file_list(directory=gcs_directory) # 

    # Create new table
    output_df = pd.DataFrame()
    fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
    counter = 0
    for file in files:

        # read files from GCS
        with fs.open(file, 'r') as f:
            gcs_data = json.loads(f.read())
            data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
            this_df = pd.DataFrame(data)
            output_df = output_df.append(this_df)

        # Write to BigQuery for every 5K rows of data
        counter += 1
        if (counter % 5000 == 0):
            pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
            output_df = pd.DataFrame() # and reset the dataframe


    # Write remaining rows to BigQuery
    pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')

This function is straightforward:

  • grab ['gcs_dir/file1.json', 'gcs_dir/file2.json', ...] , the list of file names in GCS
  • loop over each file name, and:
    • read the file from GCS
    • converts the data into a pandas DF
    • appends to a main pandas DF
    • every 5K loops, write to BigQuery (since the appends get much slower as the DF gets larger)

I have to run this function on a few GCS directories each with ~500K files. Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.

Edit : The solutions below are helpful, but I am particularly interested in running in parallel from within the python script. Pandas is handling some data cleaning, and using bq load will throw errors. There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...

Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel. This is a trick that lends itself more easily to a program that takes a list of files on the command line. So, for the sake of this post, let's consider changing one line in your program:

Your line:

# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) # 

New line:

files = sys.argv[1:]  # ok, import sys, too

Now, you can invoke your program this way:

PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program

xargs will now take the file names output by get_gcs_file_list.py and invoke your_program up to 100 times in parallel, fitting as many file names as it can on each line. I believe the number of file names is limited to the maximum command size allowed by the shell. If 100 processes is not enough to process all your files, xargs will invoke your_program again (and again) until all file names it reads from stdin are processed. xargs ensures that no more than 100 invocations of your_program are running simultaneously. You can vary the number of processes based on the resources available to your host.

Instead of doing this, you can directly use bq command.

The bq command-line tool is a Python-based command-line tool for BigQuery.

When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.

    bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/my_json_folder/*.json

For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM