简体   繁体   English

使用 multiprocessing 或 ray 与其他 cpu 绑定任务同时写入文件

[英]Writing files concurrently with other cpu-bound tasks with multiprocessing or ray

I have a workstation with 72 cores (actually 36 multithreaded CPUs, showing as 72 cores by multiprocessing.cpu_count() ).我有一个有 72 个内核的工作站(实际上是 36 个多线程 CPU,通过multiprocessing.cpu_count()显示为 72 个内核)。

I tried both multiprocessing and ray for a concurrent processing, in batches of millions of small files, and I would like to write some output files concurently during that processing.我尝试了multiprocessingray进行并发处理,批量处理数百万个小文件,我想在该处理过程中同时写入一些输出文件。

I am confused with the blocking of the .get() methods associated with eg apply_async() (in multiprocessing ) and ray.get() .我对与apply_async() (在multiprocessing )和ray.get()相关联的.get()方法的阻塞感到困惑。

With ray , I have a remote function ( process_group() ) that processes groups of data in parallel within a loop.使用ray ,我有一个远程函数( process_group() )可以在循环内并行处理数据组。 In what follows, the version of the code using the multiprocessing module is also given as comments.在下文中,使用multiprocessing模块的代码版本也作为注释给出。

import ray
import pandas as pd
# from multiprocessing import Pool

ray.init(num_cpus=60)
# with Pool(processes=n_workers) as pool:
for data_list in many_data_lists:
   ##-----------------------
   ## With ray :
   df_list = ray.get([process_group.remote(data) for data in data_list])
   ##-----------------------
   ## With multiprocessing :
   #f_list = pool.map(process_group, list_of_indices_into_data_list)
   ##
   ##      data are both known from the parent process
   ##      and I use copy-on-write semantic to avoid having 60 copies.
   ##      All the function needs are a list of indices
   ##      of where to fetch slices of the read-only data.  
   ##
   very_big_df = pd.concatenate(df_list)
   ##-----------------------
   ## Write to file :
   very_big_df.to_parquet(outputfile)

So in each loop iteration, I have to collect the output of the many process_group() , which were computed concurrently, as a list of dataframes df_list for concatenation into one bigger very_big_df dataframe.因此,在每次循环迭代中,我必须收集许多同时计算的process_group()的输出,作为数据帧df_list的列表,用于连接成一个更大的very_big_df数据帧。 The latter needs to be written to disk (typically the sizes are ~1 to ~3 GB).后者需要写入磁盘(通常大小为 ~1 到 ~3 GB)。 Writing one such file takes about 10-30 [s] while it takes about 180 [s] for the process_group remotes to get processed.编写一个这样的文件大约需要10-30 [s]process_group远程处理需要大约180 [s] There are thousands of loop iterations.有数千次循环迭代。 So this will take several days to complete.所以这需要几天时间才能完成。

Is it possible to have the file written to disk in a non-blocking manner, while the loop continues in order to save about 10% of the time (that would save about one day of computation)?是否可以以非阻塞方式将文件写入磁盘,同时循环继续以节省大约 10% 的时间(这将节省大约一天的计算时间)?

By the time the concurrent processes of the next loop iteration finishes, there is enough time for the output from the previous iteration to be written.到下一次循环迭代的并发进程完成时,有足够的时间来写入前一次迭代的输出。 The cores involved here appear to all run at near 100% so the Threading module is probably not recommended either.这里涉及的内核似乎都以接近 100% 的速度运行,因此也可能不推荐使用Threading模块。 multiprocessing.apply_async() is even more frustrating as it does not want my non-pickable output very_big_df dataframe which I would have to share with some more sophistication that may cost me the time that I am trying to save and I was hoping ray would handle something like that efficiently. multiprocessing.apply_async()更令人沮丧,因为它不希望我的不可选择的输出very_big_df数据帧,我将不得不分享一些更复杂的数据,这可能会花费我试图节省的时间,我希望ray能够处理像这样有效的东西。

[UPDATE] For sake of simplicity, I did not mention that there is a big shared variable among all the processes (Which is why i had called it a parallel process, as well as concurrent writing of the file). [更新] 为了简单起见,我没有提到所有进程之间有一个很大的共享变量(这就是为什么我称它为并行进程,以及文件的并发写入)。 My title question was edited as a result.结果我的标题问题被编辑了。 So actually, there's this bit of code before the ray parallel jobs:所以实际上,在光线并行作业之前有这么一段代码:

shared_array_id = ray.put(shared_array)
df_list = ray.get([process_group.remote(shared_array, data) for data in data_list])

Not sure though whether that makes it more like a "parallel" execution and not just concurrent operations.不确定这是否使它更像是“并行”执行而不仅仅是并发操作。

[UPDATE 2] The shared array is a lookup table, ie read-only as far as the parallel workers are concerned. [更新 2] 共享数组是一个查找表,即就并行工作人员而言是只读的。

[UPDATE 3] I tried both proposed solutions: Threading and Ray / compute() For the latter, it was suggested to use the writing function as a remote and send the writing operation asynchronously within the for loop, which I originally thought was only possible through .get() which would be blockin. [UPDATE 3] 我尝试了两种建议的解决方案:Threading 和 Ray/compute() 对于后者,建议将写入函数用作远程并在 for 循环内异步发送写入操作,我最初认为这是唯一可能的通过 .get() 这将是阻塞的。

So with Ray, this shows both solutions:因此,对于 Ray,这显示了两种解决方案:

@ray.remote
def write_to_parquet(df_list, filename):
    df = pd.concat(df_list)
    df.to_parquet(filename, engine='pyarrow', compression=None)

# Share array created outside the loop, read-only (big lookup table). 
# About 600 MB
shared_array_id = ray.put(shared_array)

for data_list in many_data_lists:

   new_df_list = ray.get([process_group.remote(shared_array_id, data) for data in data_list])
   write_to_parquet.remote(df_list, my_filename)

   ## Using threading, one would remove the ray decorator:
   # write_thread = threading.Thread(target=write_to_parquet, args=(new_df_list, tinterval.left))
   # write_thread.start()

For the RAY solution, this required however to increase the object_store_memory, the default was not enough: 10% of node memory ~ 37 GB (I have 376 GB of ram) which caps then at 20GB and the only objects stored total about 22 GB: one list of dataframes df_list (about 11 GB), and the result of their concatenation inside the writing function (about 11 GB then), assuming there is a copy during concatenation.对于 RAY 解决方案,这需要增加 object_store_memory,默认值是不够的:节点内存的 10% ~ 37 GB(我有 376 GB 的内存),然后上限为 20 GB,唯一存储的对象总计约 22 GB:一个数据帧列表df_list (大约 11 GB),以及它们在写入函数中的连接结果(大约 11 GB),假设在连接期间有一个副本。 If not, then this memory issue does not make sense and I wonder if I could pass numpy views, which I thought was happening by default.如果没有,那么这个内存问题就没有意义,我想知道我是否可以传递 numpy 视图,我认为默认情况下会发生这种情况。 This is rather frustrating aspect of RAY as I cannot really predict how much memory each df_list is going to be, it can vary from 1x to 3x...这是 RAY 相当令人沮丧的方面,因为我无法真正预测每个df_list将有多少内存,它可以从 1 倍到 3 倍不等......

In the end, sticking to multiprocessing with Threading hapens to be the most efficient solution, as the processing part (without I/O) is faster:最后,坚持使用线程进行multiprocessing是最有效的解决方案,因为处理部分(没有 I/O)更快:

from multiprocessing import Pool

# Create the shared array in the parent process & exploit copy-on-write (fork) semantics
shared_array = create_lookup_table(my_inputs)

def process_group(my_data):
   # Process a new dataframe here using my_data and some other data inside shared_array
   ...
   return my_df


n_workers = 60
with Pool(processes=n_workers) as pool:
   for data_list in many_data_lists:
      # data_list contains thousands of elements. I choose a chunksize of 10
      df_list = pool.map(process_group, data_list, 10)
      write_thread = threading.Thread(target=write_to_parquet, args=(group_df_list, tinterval.left))
            write_thread.start()

At each loop iteration, typically len(many_data_lists) = 7000 and each list contains 7 numpy arrays of size(3, 9092).在每次循环迭代中,通常len(many_data_lists) = 7000并且每个列表包含 7 个大小为 (3, 9092) 的 numpy 数组。 So these 7000 lists are sent across the 60 workers:所以这 7000 个列表被发送到 60 个工人:

time for all parallel process_group per loop iteration:每个循环迭代的所有并行process_group时间:

RAY: 250 [s]射线: 250 [s]

Multiprocessing: 233 [s]多处理: 233 [s]

I/O: It takes about 35s for a 5GB parquet file to be written on an external USB 3 spinning disk. I/O:5GB的parquet文件写入外置USB 3转盘大约需要35s。 About 10s on internal spinning disk.在内部旋转磁盘上大约 10 秒。

Ray : ~5 s overhead for creating the future with write_to_parquet.remote() which blocks the loop. Ray :使用write_to_parquet.remote()创建未来需要大约 5 秒的开销,这会阻止循环。 That is still 50% the time it would take to write on spinning disk.这仍然是在旋转磁盘上写入所需时间的 50%。 This is not ideal.这并不理想。

multiprocessing : 0 s overhead measured.多处理:测量的开销为 0 秒。

total wall times:总墙时间:

Ray : 486 [s]射线: 486 [s]

Multiprocessing : 436 [s]多处理436 [s]

I iterated this a few times, differences between Ray and Multiprocessing are consistently showing Multiprocessing faster by ~50s.我重复了几次, RayMultiprocessing之间的差异始终显示Multiprocessing快了约 50 秒。 This is a significant difference, also puzzling as Ray advertises higher efficiency.这是一个显着的差异,也令人费解,因为Ray宣传更高的效率。

I will run this for a longer number of iterations and report back on stability (memory, potential issues of gargage collection, ...)我将运行更多次迭代并报告稳定性(内存,垃圾收集的潜在问题,......)

Have you considered to assign 1 core to a ray task that writes data into a file?您是否考虑过将 1 个内核分配给将数据写入文件的 ray 任务?

[UPDATE] Prototype [更新] 原型

import ray
import pandas as pd
# from multiprocessing import Pool

ray.init(num_cpus=60)

@ray.remote
def write_to_parquet(data, filename):
    # write it until succeed.
    # record failed write somewhere. 
    # I assume failure to write is uncommon. You can probably just 
    # write ray.put() and have one background process that keeps failed 
    # write again.

# with Pool(processes=n_workers) as pool:
for data_list in many_data_lists:
   ##-----------------------
   ## With ray :
   df_list = ray.get([process_group.remote(data) for data in data_list])
   ##-----------------------
   ## With multiprocessing :
   #f_list = pool.map(process_group, list_of_indices_into_data_list)
   ##
   ##      data are both known from the parent process
   ##      and I use copy-on-write semantic to avoid having 60 copies.
   ##      All the function needs are a list of indices
   ##      of where to fetch slices of the read-only data.  
   ##
   very_big_df = pd.concatenate(df_list)
   ##-----------------------
   ## Write to file :

   write_to_parquet.remote(very_big_df, filename)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM