简体   繁体   中英

Efficient way of finding the maximum absolute value, for many columns

I have the following DataFrame ,

import random
import pandas as pd
random.seed(2)

n_observations_per_user = 3
n_users = 3
n_dimensions = 2
ids = []
for i in range(n_users):
    ids += [i]*n_observations_per_user

data = {"id": ids}
for idim in range(n_dimensions):
    data[f"dim{idim}"] = [random.uniform(-10, 10) for i in range(n_observations_per_user*n_users)]
    
df = pd.DataFrame(data)
df

    id  dim0        dim1
0   0   9.120685    2.136035
1   0   8.956550    1.624080
2   0   -8.868973   -6.832343
3   1   -8.302560   -1.386607
4   1   6.709978    -2.129364
5   1   4.719400    4.460242
6   2   3.394608    9.896391
7   2   -3.837271   8.987909
8   2   2.118883    0.883541

I need to compute the maximum absolute value for each dimension (column), grouped on id . To do this I use .agg() as such:

abs_max_fun = lambda x: x[x.abs().idxmax()]
agg_dict_absmax = {"id": "first"}
for idim in range(n_dimensions):
    agg_dict_absmax[f"dim{idim}"] = abs_max_fun

df.groupby("id").agg(agg_dict_absmax)
    id  dim0        dim1
id          
0   0   9.120685    -6.832343
1   1   -8.302560   4.460242
2   2   -3.837271   9.896391

which is correct. When the n_observations_per_user , n_users and n_dimensions become large however, this method of aggregation becomes slow, in comparison with for example max (without abs), as can be seen:

# Create new, large df, with the following:
n_observations_per_user = 100
n_users = 1000
n_dimensions = 100

# Measure time for max-abs
import time
abs_max_fun = lambda x: x[x.abs().idxmax()]
agg_dict_absmax = {"id": "first"}
for idim in range(n_dimensions):
    agg_dict_absmax[f"dim{idim}"] = abs_max_fun

start = time.time()
df.groupby("id").agg(agg_dict_absmax)
end = time.time()
print(end - start)

Output: 27.204503297805786

In comparison with max :

import time
agg_dict_max = {"id": "first"}
for idim in range(n_dimensions):
    agg_dict_max[f"dim{idim}"] = "max"
    
start = time.time()
df.groupby("id").agg(agg_dict_max)
end = time.time()
print(end - start)

Output: 0.10446596145629883

My use case has an even larger DataFrame (more users), so I am looking for a solution where the max-abs-aggregation can become faster, ideally as fast as max . Unless the theoretical time-complexity of finding the max abs value prohibits this, such a solution would be ideal.

Any ideas how this could be done?

Rather than doing the (inefficient) absolute max value calculation for each group during the groupby, you can get the max and min values per group using the optimized built-in operations, and only then figure out which absolute value is higher.

import pandas as pd
import numpy as np

n_rows = 1_000_000
n_cols = 1_000
df = pd.DataFrame(np.random.random((n_rows, n_cols)) - 0.5)
df["group"] = np.random.randint(0, 400, (n_rows))

df_max = df.groupby("group").max()
df_min = df.groupby("group").min()
df_absmax = pd.DataFrame(
    np.where(df_max > -df_min, df_max, df_min),
    index=df_max.index,
    columns=df_max.columns
)

The above example takes just over twice as long to run as df.groupby("group").max() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM