[英]Python read .json files from GCS into pandas DF in parallel
TL;DR: asyncio
vs multi-processing
vs threading
vs. some other solution
to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery... TL;DR:
asyncio
、 multi-processing
、 threading
和some other solution
,以并行化从 GCS 读取文件的循环,然后将此数据一起附加到 pandas dataframe,然后写入 BigQuery...
I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table. I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.
Here is a not-parallel version of the function:这是 function 的非并行版本:
import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
# Create new table
output_df = pd.DataFrame()
fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
counter = 0
for file in files:
# read files from GCS
with fs.open(file, 'r') as f:
gcs_data = json.loads(f.read())
data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
this_df = pd.DataFrame(data)
output_df = output_df.append(this_df)
# Write to BigQuery for every 5K rows of data
counter += 1
if (counter % 5000 == 0):
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
output_df = pd.DataFrame() # and reset the dataframe
# Write remaining rows to BigQuery
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
This function is straightforward:这个 function 很简单:
['gcs_dir/file1.json', 'gcs_dir/file2.json', ...]
, the list of file names in GCS['gcs_dir/file1.json', 'gcs_dir/file2.json', ...]
,GCS 中的文件名列表 I have to run this function on a few GCS directories each with ~500K files.我必须在几个 GCS 目录上运行这个 function,每个目录都有 ~500K 文件。 Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.
由于读取/写入这么多小文件的瓶颈,单个目录的这个过程大约需要 24 小时......如果我可以让它更加并行以加快速度,那就太好了,因为这似乎是一项任务适合并行化。
Edit : The solutions below are helpful, but I am particularly interested in running in parallel from within the python script.编辑:下面的解决方案很有帮助,但我对在 python 脚本中并行运行特别感兴趣。 Pandas is handling some data cleaning, and using
bq load
will throw errors. Pandas 正在处理一些数据清理,使用
bq load
会抛出错误。 There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...有asyncio和这个gcloud-aio-storage似乎都可能对这项任务有用,可能是比线程或多处理更好的选择......
Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel.与其向 python 代码添加并行处理,不如考虑并行多次调用 python 程序。 This is a trick that lends itself more easily to a program that takes a list of files on the command line.
这是一个技巧,它更容易适用于在命令行上获取文件列表的程序。 So, for the sake of this post, let's consider changing one line in your program:
因此,为了这篇文章,让我们考虑更改程序中的一行:
Your line:您的线路:
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
New line:新队:
files = sys.argv[1:] # ok, import sys, too
Now, you can invoke your program this way:现在,您可以通过这种方式调用您的程序:
PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program
xargs
will now take the file names output by get_gcs_file_list.py
and invoke your_program
up to 100 times in parallel, fitting as many file names as it can on each line. xargs
现在将通过get_gcs_file_list.py
获取文件名 output 并并行调用your_program
多达 100 次,在每行上安装尽可能多的文件名。 I believe the number of file names is limited to the maximum command size allowed by the shell.我相信文件名的数量仅限于 shell 允许的最大命令大小。 If 100 processes is not enough to process all your files, xargs will invoke
your_program
again (and again) until all file names it reads from stdin are processed.如果 100 个进程不足以处理所有文件,则 xargs 将再次调用
your_program
(一次又一次),直到它从标准输入读取的所有文件名都被处理。 xargs
ensures that no more than 100 invocations of your_program
are running simultaneously. xargs
确保同时运行的your_program
调用不超过 100 个。 You can vary the number of processes based on the resources available to your host.您可以根据主机可用的资源来改变进程数。
Instead of doing this, you can directly use bq
command.而不是这样做,您可以直接使用
bq
命令。
The bq command-line tool is a Python-based command-line tool for BigQuery.
bq 命令行工具是基于 Python 的 BigQuery 命令行工具。
When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.当您使用此命令时,加载发生在 google 的网络中,这比我们创建 dataframe 并加载到表中要快得多。
bq load \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable \
gs://mybucket/my_json_folder/*.json
For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table有关更多信息 - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.