简体   繁体   中英

Writing files concurrently with other cpu-bound tasks with multiprocessing or ray

I have a workstation with 72 cores (actually 36 multithreaded CPUs, showing as 72 cores by multiprocessing.cpu_count() ).

I tried both multiprocessing and ray for a concurrent processing, in batches of millions of small files, and I would like to write some output files concurently during that processing.

I am confused with the blocking of the .get() methods associated with eg apply_async() (in multiprocessing ) and ray.get() .

With ray , I have a remote function ( process_group() ) that processes groups of data in parallel within a loop. In what follows, the version of the code using the multiprocessing module is also given as comments.

import ray
import pandas as pd
# from multiprocessing import Pool

ray.init(num_cpus=60)
# with Pool(processes=n_workers) as pool:
for data_list in many_data_lists:
   ##-----------------------
   ## With ray :
   df_list = ray.get([process_group.remote(data) for data in data_list])
   ##-----------------------
   ## With multiprocessing :
   #f_list = pool.map(process_group, list_of_indices_into_data_list)
   ##
   ##      data are both known from the parent process
   ##      and I use copy-on-write semantic to avoid having 60 copies.
   ##      All the function needs are a list of indices
   ##      of where to fetch slices of the read-only data.  
   ##
   very_big_df = pd.concatenate(df_list)
   ##-----------------------
   ## Write to file :
   very_big_df.to_parquet(outputfile)

So in each loop iteration, I have to collect the output of the many process_group() , which were computed concurrently, as a list of dataframes df_list for concatenation into one bigger very_big_df dataframe. The latter needs to be written to disk (typically the sizes are ~1 to ~3 GB). Writing one such file takes about 10-30 [s] while it takes about 180 [s] for the process_group remotes to get processed. There are thousands of loop iterations. So this will take several days to complete.

Is it possible to have the file written to disk in a non-blocking manner, while the loop continues in order to save about 10% of the time (that would save about one day of computation)?

By the time the concurrent processes of the next loop iteration finishes, there is enough time for the output from the previous iteration to be written. The cores involved here appear to all run at near 100% so the Threading module is probably not recommended either. multiprocessing.apply_async() is even more frustrating as it does not want my non-pickable output very_big_df dataframe which I would have to share with some more sophistication that may cost me the time that I am trying to save and I was hoping ray would handle something like that efficiently.

[UPDATE] For sake of simplicity, I did not mention that there is a big shared variable among all the processes (Which is why i had called it a parallel process, as well as concurrent writing of the file). My title question was edited as a result. So actually, there's this bit of code before the ray parallel jobs:

shared_array_id = ray.put(shared_array)
df_list = ray.get([process_group.remote(shared_array, data) for data in data_list])

Not sure though whether that makes it more like a "parallel" execution and not just concurrent operations.

[UPDATE 2] The shared array is a lookup table, ie read-only as far as the parallel workers are concerned.

[UPDATE 3] I tried both proposed solutions: Threading and Ray / compute() For the latter, it was suggested to use the writing function as a remote and send the writing operation asynchronously within the for loop, which I originally thought was only possible through .get() which would be blockin.

So with Ray, this shows both solutions:

@ray.remote
def write_to_parquet(df_list, filename):
    df = pd.concat(df_list)
    df.to_parquet(filename, engine='pyarrow', compression=None)

# Share array created outside the loop, read-only (big lookup table). 
# About 600 MB
shared_array_id = ray.put(shared_array)

for data_list in many_data_lists:

   new_df_list = ray.get([process_group.remote(shared_array_id, data) for data in data_list])
   write_to_parquet.remote(df_list, my_filename)

   ## Using threading, one would remove the ray decorator:
   # write_thread = threading.Thread(target=write_to_parquet, args=(new_df_list, tinterval.left))
   # write_thread.start()

For the RAY solution, this required however to increase the object_store_memory, the default was not enough: 10% of node memory ~ 37 GB (I have 376 GB of ram) which caps then at 20GB and the only objects stored total about 22 GB: one list of dataframes df_list (about 11 GB), and the result of their concatenation inside the writing function (about 11 GB then), assuming there is a copy during concatenation. If not, then this memory issue does not make sense and I wonder if I could pass numpy views, which I thought was happening by default. This is rather frustrating aspect of RAY as I cannot really predict how much memory each df_list is going to be, it can vary from 1x to 3x...

In the end, sticking to multiprocessing with Threading hapens to be the most efficient solution, as the processing part (without I/O) is faster:

from multiprocessing import Pool

# Create the shared array in the parent process & exploit copy-on-write (fork) semantics
shared_array = create_lookup_table(my_inputs)

def process_group(my_data):
   # Process a new dataframe here using my_data and some other data inside shared_array
   ...
   return my_df


n_workers = 60
with Pool(processes=n_workers) as pool:
   for data_list in many_data_lists:
      # data_list contains thousands of elements. I choose a chunksize of 10
      df_list = pool.map(process_group, data_list, 10)
      write_thread = threading.Thread(target=write_to_parquet, args=(group_df_list, tinterval.left))
            write_thread.start()

At each loop iteration, typically len(many_data_lists) = 7000 and each list contains 7 numpy arrays of size(3, 9092). So these 7000 lists are sent across the 60 workers:

time for all parallel process_group per loop iteration:

RAY: 250 [s]

Multiprocessing: 233 [s]

I/O: It takes about 35s for a 5GB parquet file to be written on an external USB 3 spinning disk. About 10s on internal spinning disk.

Ray : ~5 s overhead for creating the future with write_to_parquet.remote() which blocks the loop. That is still 50% the time it would take to write on spinning disk. This is not ideal.

multiprocessing : 0 s overhead measured.

total wall times:

Ray : 486 [s]

Multiprocessing : 436 [s]

I iterated this a few times, differences between Ray and Multiprocessing are consistently showing Multiprocessing faster by ~50s. This is a significant difference, also puzzling as Ray advertises higher efficiency.

I will run this for a longer number of iterations and report back on stability (memory, potential issues of gargage collection, ...)

Have you considered to assign 1 core to a ray task that writes data into a file?

[UPDATE] Prototype

import ray
import pandas as pd
# from multiprocessing import Pool

ray.init(num_cpus=60)

@ray.remote
def write_to_parquet(data, filename):
    # write it until succeed.
    # record failed write somewhere. 
    # I assume failure to write is uncommon. You can probably just 
    # write ray.put() and have one background process that keeps failed 
    # write again.

# with Pool(processes=n_workers) as pool:
for data_list in many_data_lists:
   ##-----------------------
   ## With ray :
   df_list = ray.get([process_group.remote(data) for data in data_list])
   ##-----------------------
   ## With multiprocessing :
   #f_list = pool.map(process_group, list_of_indices_into_data_list)
   ##
   ##      data are both known from the parent process
   ##      and I use copy-on-write semantic to avoid having 60 copies.
   ##      All the function needs are a list of indices
   ##      of where to fetch slices of the read-only data.  
   ##
   very_big_df = pd.concatenate(df_list)
   ##-----------------------
   ## Write to file :

   write_to_parquet.remote(very_big_df, filename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM