简体   繁体   English

Pandas 为 groupby 中的每一列获取三个最常见的值

[英]Pandas get three most common values for every column in groupby

I have a table like this:我有一张这样的桌子:

  colour number letter
0 red    one    a
1 red    two    b
2 red    two    c
3 blue   two    a
4 blue   two    b
5 green  one    a
6 green  two    b
7 green  three  c

Which I made by doing:我做的:

df = pd.DataFrame([
    ('red', 'one', 'a'),
    ('red', 'two', 'b'),
    ('red', 'two', 'c'),
    ('blue', 'two', 'a'),
    ('blue', 'two', 'b'),
    ('green', 'one', 'a'),
    ('green', 'two', 'b'),
    ('green', 'three', 'c')
], columns=['colour', 'number', 'letter'])

I want to group the table by colour, and then for every remaining column get the three most common values.我想按颜色对表格进行分组,然后为剩余的每一列获取三个最常见的值。 If there aren't three unique values for a column, then the last could be repeated or it could be NaN , either works.如果一列没有三个唯一值,则可以重复最后一个值,也可以是NaN ,两者都可以。 The output would look like: output 看起来像:

       colour  red  blue  green  
number 1       two  two   one
       2       one  two   two
       3       one  two   three
letter 1       a    a     a
       2       b    b     b
       3       c    b     c

Or:或者:

       colour  red  blue  green  
number 1       two  two   one
       2       one  NaN   two
       3       NaN  NaN   three
letter 1       a    a     a
       2       b    b     b
       3       c    NaN   c

I have already done this for a single column:我已经为单个列完成了此操作:

df.groupby('colour').number
  .value_counts()
  .groupby(level=0)
  .head(3)

Output: Output:

colour  number  
blue    two     2
green   one     1
        two     1
        three   1
red     two     2
        one     1

However I would like to do it for all columns in my dataframe and get an output like the example.但是,我想对我的 dataframe 中的所有列执行此操作,并像示例一样获得 output。 I am completely stuck.我完全被困住了。

Try:尝试:

def fn(x):
    return pd.Series(
        (x.value_counts().index[:3].tolist() + [np.nan, np.nan])[:3],
        index=range(1, 4),
    )


out = pd.concat(
    [
        df.groupby("colour")[col].apply(fn).unstack(level=0).ffill()
        for col in df.loc[:, "number":]
    ],
    keys=df.loc[:, "number":],
)
print(out)

Prints:印刷:

colour   blue  green  red
number 1  two  three  two
       2  two    two  one
       3  two    one  one
letter 1    b      b    b
       2    a      a    a
       3    a      c    c

Not pretty but I got it done:不漂亮,但我完成了:

def analyze_col(df, col, grpby):
    top3: pd.Series = df.groupby(grpby)[col].value_counts().groupby(level=0).head(3)

    gg = pd.DataFrame({
        g[0]: g[1].index.get_level_values(1).to_series(index=range(1, len(g[1]) + 1)).reindex(range(1, 4))
        for g in top3.groupby(level=0)
    })

    return pd.concat({col: gg}, names=[grpby])


def analyze_df(df, grpby):
    return pd.concat([analyze_col(df, col, grpby) for col in df.columns if col != grpby])


print(analyze_df(df, 'colour'))
         blue  green  red
colour                   
number 1  two    one  two
       2  NaN  three  one
       3  NaN    two  NaN
letter 1    a      a    a
       2    b      b    b
       3  NaN      c    c
k=df.groupby(['colour','letter']).number.value_counts(lambda x : x).groupby(level=0).head(3)

Output
colour  letter  number
blue    a       two       1.0
        b       two       1.0
green   a       one       1.0
        b       two       1.0
        c       three     1.0
red     a       one       1.0
        b       two       1.0
        c       two       1.0
Name: number, dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM