简体   繁体   English

查找每行的最常见值由列表组成

[英]Find most common values for each row consist of lists

I have pd.DataFrame in which one column contains lists as the values.我有pd.DataFrame ,其中一列包含lists作为值。 I want to create another column which consist only the most common value from that column.我想创建另一列,其中仅包含该列中最常见的值。 Example dataframe:示例 dataframe:

    col_1
0   [1, 2, 3, 3]
1   [2, 2, 8, 8, 7]
2   [3, 4]

And the expected dataframe is而预期的 dataframe 是

    col_1           col_2
0   [1, 2, 3, 3]    [3]
1   [2, 2, 8, 8, 7] [2, 8]
2   [3, 4]          [3, 4]

I tried to do我试着做

from statistics import mode
df['col_1'].apply(lambda x: mode(x)) 

But it is showing the most common list in that column.但它显示了该列中最常见的列表。

I also tried to use pandas mode function directly on that column, it also did not help.我也尝试直接在该列上使用 pandas mode function,它也没有帮助。 Is there any way to find the most common value(s)?有没有办法找到最常见的值?

Or just use multimode from the statistics module.或者只使用统计模块中的multimode

df['col_2'] = df['col_1'].apply(lambda x: multimode(x))
              col1    col2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Use Series.mode - but it is slow:使用Series.mode - 但它很慢:

df['new'] = df['col_1'].apply(lambda x: pd.Series(x).mode().tolist()) 
print (df)
             col_1     new
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Or use statistics.multimode if performance is important:或者如果性能很重要,请使用statistics.multimode

from statistics import multimode

df['col_2'] = df['col_1'].apply(multimode) 
print (df)
             col_1   col_2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Performance :性能

#[3000 rows x 4 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [195]: %timeit (df['col_1'].explode().groupby(level=0).apply(lambda x: x.mode().tolist()))
537 ms ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [196]: %timeit df['col_1'].apply(lambda x: pd.Series(x).mode().tolist())
699 ms ± 77.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [197]: %timeit df['col_1'].apply(multimode)
13.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using mode per group:每组使用mode

df['col_2'] = (df['col_1']
               .explode()
               .groupby(level=0)
               .apply(lambda x: x.mode().tolist())
              )

output: output:

             col_1   col_2
0     [1, 2, 3, 3]     [3]
1  [2, 2, 8, 8, 7]  [2, 8]
2           [3, 4]  [3, 4]

Try this..尝试这个..

from collections import Counter

col_1 = [[1, 2, 3, 3],[2, 2, 8, 8, 7],[3, 4]]
df = pd.DataFrame({'col_1':col_1})

def common(row):
    c = Counter(row)
    c = pd.Series(c)
    return c[c==max(c)].index.values

df['col_2'] = df.col_1.map(common)

df去向

     col_1            col_2
0    [1, 2, 3, 3]     [3]
1    [2, 2, 8, 8, 7]  [2, 8]
2    [3, 4]           [3, 4]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM