简体   繁体   中英

MultiIndex pandas DataFrame - how to add "Other" column from remaining rows

I have a 2-level index DataFrame and I'd like to leave only 2 top rows for second level and add a row "Other" for the rest of rows. This is my DataFrame:

import pandas as pd

df = pd.DataFrame()
df["Idx1"] = ["A", "A", "A", "A", "B", "B", "B", "B"]
df["Idx2"] = ["X", "Y", "Z", "Q", "X", "Y", "Z", "Q"]
df["Values"] = [1,2,3,4, 1,2,3,4]
df = df.set_index(["Idx1", "Idx2"])
df

and this is how I can extract top 2 rows for first level:

res = df.groupby(level = 0).head(2)
res

However I have problem with adding "Other" column. My desired output would be:

enter image description here

How to do it in the most elegant way?

One option using groupby.apply to get andconcat the first two values with head , and the sum of the rest:

(df.groupby(level='Idx1')
   .apply(lambda d: pd.concat([d.droplevel(0).head(2),
                               d.droplevel(0).tail(-2).sum().to_frame(name='Other').T,
                               ]).rename_axis('Idx2'))
)

Output:

            Values
Idx1 Idx2         
A    X           1
     Y           2
     Other       7
B    X           1
     Y           2
     Other       7

I think you can convert second level of MultiIndex by Other by counter by GroupBy.cumcount and then aggregate sum :

idx = df.index.get_level_values(1).where(df.groupby(level = 0).cumcount().lt(2), 'Other')

df = df.groupby(['Idx1',idx], sort=False).sum()
print (df)
            Values
Idx1 Idx2         
A    X           1
     Y           2
     Other       7
B    X           1
     Y           2
     Other       7

Performance for 1k rows in 10k rows of data:

np.random.seed(2023)

N = 10000
L = list('abcdefghijklmno')

df = pd.DataFrame({'Idx1': np.random.randint(1000, size=N),
                   'Idx2': np.random.choice(L, size=N),
                   'Values':np.random.randint(1000, size=N)}).set_index(["Idx1", "Idx2"]).sort_index()
print (df)


In [230]: %%timeit
     ...: idx = df.index.get_level_values(1).where(df.groupby(level = 0).cumcount().lt(2), 'Other')
     ...: 
     ...: df.groupby(['Idx1',idx], sort=False).sum()
     ...: 
5.62 ms ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another solution from mozway is 382 times slowier:

In [231]: %%timeit
     ...: (df.groupby(level='Idx1')
     ...:    .apply(lambda d: pd.concat([d.droplevel(0).head(2),
     ...:                                d.droplevel(0).tail(-2).sum().to_frame(name='Other').T,
     ...:                                ]).rename_axis('Idx2'))
     ...: )
     ...: 
2.15 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM