Efficient way of finding the maximum absolute value, for many columns

Question

I have the following DataFrame ,

import random
import pandas as pd
random.seed(2)

n_observations_per_user = 3
n_users = 3
n_dimensions = 2
ids = []
for i in range(n_users):
    ids += [i]*n_observations_per_user

data = {"id": ids}
for idim in range(n_dimensions):
    data[f"dim{idim}"] = [random.uniform(-10, 10) for i in range(n_observations_per_user*n_users)]
    
df = pd.DataFrame(data)
df

    id  dim0        dim1
0   0   9.120685    2.136035
1   0   8.956550    1.624080
2   0   -8.868973   -6.832343
3   1   -8.302560   -1.386607
4   1   6.709978    -2.129364
5   1   4.719400    4.460242
6   2   3.394608    9.896391
7   2   -3.837271   8.987909
8   2   2.118883    0.883541

I need to compute the maximum absolute value for each dimension (column), grouped on id . To do this I use .agg() as such:

abs_max_fun = lambda x: x[x.abs().idxmax()]
agg_dict_absmax = {"id": "first"}
for idim in range(n_dimensions):
    agg_dict_absmax[f"dim{idim}"] = abs_max_fun

df.groupby("id").agg(agg_dict_absmax)
    id  dim0        dim1
id          
0   0   9.120685    -6.832343
1   1   -8.302560   4.460242
2   2   -3.837271   9.896391

which is correct. When the n_observations_per_user , n_users and n_dimensions become large however, this method of aggregation becomes slow, in comparison with for example max (without abs), as can be seen:

# Create new, large df, with the following:
n_observations_per_user = 100
n_users = 1000
n_dimensions = 100

# Measure time for max-abs
import time
abs_max_fun = lambda x: x[x.abs().idxmax()]
agg_dict_absmax = {"id": "first"}
for idim in range(n_dimensions):
    agg_dict_absmax[f"dim{idim}"] = abs_max_fun

start = time.time()
df.groupby("id").agg(agg_dict_absmax)
end = time.time()
print(end - start)

Output: 27.204503297805786

In comparison with max :

import time
agg_dict_max = {"id": "first"}
for idim in range(n_dimensions):
    agg_dict_max[f"dim{idim}"] = "max"
    
start = time.time()
df.groupby("id").agg(agg_dict_max)
end = time.time()
print(end - start)

Output: 0.10446596145629883

My use case has an even larger DataFrame (more users), so I am looking for a solution where the max-abs-aggregation can become faster, ideally as fast as max . Unless the theoretical time-complexity of finding the max abs value prohibits this, such a solution would be ideal.

Any ideas how this could be done?

Answer 1

Rather than doing the (inefficient) absolute max value calculation for each group during the groupby, you can get the max and min values per group using the optimized built-in operations, and only then figure out which absolute value is higher.

import pandas as pd
import numpy as np

n_rows = 1_000_000
n_cols = 1_000
df = pd.DataFrame(np.random.random((n_rows, n_cols)) - 0.5)
df["group"] = np.random.randint(0, 400, (n_rows))

df_max = df.groupby("group").max()
df_min = df.groupby("group").min()
df_absmax = pd.DataFrame(
    np.where(df_max > -df_min, df_max, df_min),
    index=df_max.index,
    columns=df_max.columns
)

The above example takes just over twice as long to run as df.groupby("group").max() .

Efficient way of finding the maximum absolute value, for many columns

Question

1 answers

solution1
2 ACCPTED 2020-10-27 17:18:53

Efficient way of finding the maximum absolute value, for many columns

Question

1 answers

solution1 2 ACCPTED 2020-10-27 17:18:53

solution1
2 ACCPTED 2020-10-27 17:18:53