Pandas 为 groupby 中的每一列获取三个最常见的值

Question

I have a table like this:我有一张这样的桌子：

  colour number letter
0 red    one    a
1 red    two    b
2 red    two    c
3 blue   two    a
4 blue   two    b
5 green  one    a
6 green  two    b
7 green  three  c

Which I made by doing:我做的：

df = pd.DataFrame([
    ('red', 'one', 'a'),
    ('red', 'two', 'b'),
    ('red', 'two', 'c'),
    ('blue', 'two', 'a'),
    ('blue', 'two', 'b'),
    ('green', 'one', 'a'),
    ('green', 'two', 'b'),
    ('green', 'three', 'c')
], columns=['colour', 'number', 'letter'])

I want to group the table by colour, and then for every remaining column get the three most common values.我想按颜色对表格进行分组，然后为剩余的每一列获取三个最常见的值。 If there aren't three unique values for a column, then the last could be repeated or it could be NaN , either works.如果一列没有三个唯一值，则可以重复最后一个值，也可以是NaN ，两者都可以。 The output would look like: output 看起来像：

       colour  red  blue  green  
number 1       two  two   one
       2       one  two   two
       3       one  two   three
letter 1       a    a     a
       2       b    b     b
       3       c    b     c

Or:或者：

       colour  red  blue  green  
number 1       two  two   one
       2       one  NaN   two
       3       NaN  NaN   three
letter 1       a    a     a
       2       b    b     b
       3       c    NaN   c

I have already done this for a single column:我已经为单个列完成了此操作：

df.groupby('colour').number
  .value_counts()
  .groupby(level=0)
  .head(3)

Output: Output：

colour  number  
blue    two     2
green   one     1
        two     1
        three   1
red     two     2
        one     1

However I would like to do it for all columns in my dataframe and get an output like the example.但是，我想对我的 dataframe 中的所有列执行此操作，并像示例一样获得 output。 I am completely stuck.我完全被困住了。

Answer 1

Try:尝试：

def fn(x):
    return pd.Series(
        (x.value_counts().index[:3].tolist() + [np.nan, np.nan])[:3],
        index=range(1, 4),
    )


out = pd.concat(
    [
        df.groupby("colour")[col].apply(fn).unstack(level=0).ffill()
        for col in df.loc[:, "number":]
    ],
    keys=df.loc[:, "number":],
)
print(out)

Prints:印刷：

colour   blue  green  red
number 1  two  three  two
       2  two    two  one
       3  two    one  one
letter 1    b      b    b
       2    a      a    a
       3    a      c    c

Answer 2

Not pretty but I got it done:不漂亮，但我完成了：

def analyze_col(df, col, grpby):
    top3: pd.Series = df.groupby(grpby)[col].value_counts().groupby(level=0).head(3)

    gg = pd.DataFrame({
        g[0]: g[1].index.get_level_values(1).to_series(index=range(1, len(g[1]) + 1)).reindex(range(1, 4))
        for g in top3.groupby(level=0)
    })

    return pd.concat({col: gg}, names=[grpby])


def analyze_df(df, grpby):
    return pd.concat([analyze_col(df, col, grpby) for col in df.columns if col != grpby])


print(analyze_df(df, 'colour'))

         blue  green  red
colour                   
number 1  two    one  two
       2  NaN  three  one
       3  NaN    two  NaN
letter 1    a      a    a
       2    b      b    b
       3  NaN      c    c

Answer 3

k=df.groupby(['colour','letter']).number.value_counts(lambda x : x).groupby(level=0).head(3)

Output

colour  letter  number
blue    a       two       1.0
        b       two       1.0
green   a       one       1.0
        b       two       1.0
        c       three     1.0
red     a       one       1.0
        b       two       1.0
        c       two       1.0
Name: number, dtype: float64

Pandas 为 groupby 中的每一列获取三个最常见的值

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-04-25 23:07:41

解决方案2
1 2021-04-25 23:56:07

解决方案3
-1 2021-04-25 22:25:21

Pandas 为 groupby 中的每一列获取三个最常见的值

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-04-25 23:07:41

解决方案2 1 2021-04-25 23:56:07

解决方案3 -1 2021-04-25 22:25:21

解决方案1
1 已采纳 2021-04-25 23:07:41

解决方案2
1 2021-04-25 23:56:07

解决方案3
-1 2021-04-25 22:25:21