简体   繁体   中英

Parallelizing operations for pandas dataframe

I have a very large pandas dataframe similar as follows:

╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ 0      ║ user_1      ║    304 ║
║ 1      ║ user_2      ║    299 ║
║ ...    ║             ║        ║
║ 399999 ║ user_400000 ║    542 ║
╚════════╩═════════════╩════════╝

(There are few columns more needed to do some calculations)

So, for every client I have to apply lots and lots of operations (shift, sums, substractions, conditions, etc) so it's impossible (to my believe) to apply boolean masking for everything, I have already tried, so my question is if it's possible to divide the pandas dataframe in chunks as follows, eg:

# chunk 1
╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ 0      ║ user_1      ║    304 ║
║ 1      ║ user_2      ║    299 ║
║ ...    ║             ║        ║
║ 19999  ║ user_20000  ║    432 ║
╚════════╩═════════════╩════════╝

# chunk 2
╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ 20000  ║ user_20000  ║    199 ║
║ 20001  ║ user_20001  ║    412 ║
║ ...    ║             ║        ║
║ 39999  ║ user_40000  ║    725 ║
╚════════╩═════════════╩════════╝

# chunk K 
╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ ...    ║ user_...    ║    ... ║
║ ...    ║ user_...    ║    ... ║
║ ...    ║             ║        ║
║ ...    ║ user_...    ║    ... ║
╚════════╩═════════════╩════════╝

And apply all the operations too all those chunks in parallel.

You can use a multiprocessing pool to accomplish some of those tasks, however, multiprocessing is also an expensive operation, so you need to test if parallelizing it is actually faster, it depends on the type of functions you are running and the data, for example I create a sample df :

import pandas as pd
import numpy as np
from random import randint
from multiprocessing import Pool, cpu_count
from timeit import timeit


def f(df: pd.DataFrame):
    df['Something'] = df['Users'].apply(lambda name: len(name))
    df['Other stuff'] = df['Income'].apply(lambda income: 'Senior' if income > 200 else 'Junior')
    df['Some other stuff'] = df['Users'].apply(lambda name:  name.count('1'))
    return df


if __name__ == '__main__':
    samples = 5000000
    df = pd.DataFrame(
        [
            ['user_' + str(i), randint(0, 500)] for i in range(1, samples)
        ], columns=['Users', 'Income']
    )

If we time this version of the f function with multiprocessing I get 38.189200899999996 in my old laptop:

    parallelized = timeit("""
cores = cpu_count()
df_in_chunks = np.array_split(df, cores)
pool = Pool(cores)
result_df = pd.concat(pool.map(f, df_in_chunks))
pool.close()
pool.join()
    """, 
    "from __main__ import pd, np, df, Pool, cpu_count, f", 
    number=5
    )
    print(parallelized)

In this case I get 25.0754394 , so the overhead of using multiprocessing is bigger than the execution time of running the entire thing in a single core.

    not_parallelized = timeit("""
result_df = f(df)
    """, 
    "from __main__ import pd, df, f", 
    number=5
    )
    print(not_parallelized)

However if we add more complexity to the f function there is a point where broadcasting the df to each process is cheaper than running it in a single core.

From my knowledge, the pandas GroupBy:-Split,-Apply,-Combine may address your issue. Divide your DataFrame into multiple chunks (groups), and then apply a to each chunk (group).We can talk codes if you have any further issues.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM