繁体   English   中英

如何找到熊猫数据框中列中出现的最常见单词

[英]How to find most frequent word which comes in column in pandas dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame({'City': ['Pune', 'Mumbai', 'Pune', 'Mumbai', 'Pune'],
        'Name': ['John', 'Boby', 'John', 'Boby', 'Nicky'], 
           'Competition': ['Chess,Drawing,Chess', 'Table Tennis,Table Tennis,Chess,Carrom', 'Chess,Carrom', 'Table Tennis,Chess,Chess,Chess', 'Carrom'] })
     City   Name    Competition
0   Pune    John    Chess,Drawing,Chess
1   Mumbai  Boby    Table Tennis,Table Tennis,Chess,Carrom
2   Pune    John    Chess,Carrom
3   Mumbai  Boby    Table Tennis,Chess,Chess,Chess
4   Pune    Nicky   Carrom
Required output
    City    Name    Competition                                  Most Frequent
0   Pune    John    Chess,Drawing,Chess                            Chess
1   Mumbai  Boby    Table Tennis,Table Tennis,Chess,Carrom         Table Tennis
2   Pune    John    Chess,Carrom,Chess,Carrom                      Carrom,Chess
3   Mumbai  Boby    Table Tennis,Chess,Chess,Chess                 Chess
4   Pune    Nicky   Carrom                                         Carrom

如果词数相等,则添加两个词。否则最常见的词

首先使用DataFrame.explode拆分列中的值,因此可以获取Series.mode并连接所有顶部值:

f = lambda x: ','.join(x.mode())
df['Most Frequent'] = (df.assign(Competition = df['Competition'].str.split(','))
                         .explode('Competition')
                         .groupby(level=0)['Competition'] 
                         .agg(f))
print (df)
     City   Name                             Competition Most Frequent
0    Pune   John                     Chess,Drawing,Chess         Chess
1  Mumbai   Boby  Table Tennis,Table Tennis,Chess,Carrom  Table Tennis
2    Pune   John                            Chess,Carrom  Carrom,Chess
3  Mumbai   Boby          Table Tennis,Chess,Chess,Chess         Chess
4    Pune  Nicky                                  Carrom        Carrom

apply中使用statistics.multimode

import pandas as pd
from statistics import multimode

df = pd.DataFrame({'City': ['Pune', 'Mumbai', 'Pune', 'Mumbai', 'Pune'],
                   'Name': ['John', 'Boby', 'John', 'Boby', 'Nicky'],
                   'Competition': ['Chess,Drawing,Chess', 'Table Tennis,Table Tennis,Chess,Carrom', 'Chess,Carrom',
                                   'Table Tennis,Chess,Chess,Chess', 'Carrom']})

df["Most Frequent"] = df["Competition"].apply(lambda x: ",".join(multimode(x.split(","))[:2]))
print(df)

输出

     City   Name                             Competition Most Frequent
0    Pune   John                     Chess,Drawing,Chess         Chess
1  Mumbai   Boby  Table Tennis,Table Tennis,Chess,Carrom  Table Tennis
2    Pune   John                            Chess,Carrom  Chess,Carrom
3  Mumbai   Boby          Table Tennis,Chess,Chess,Chess         Chess
4    Pune  Nicky                                  Carrom        Carrom

这是一个使用Counter的简单而全面的解决方案。

from collections import Counter

def keywithmaxval(d):
    itemMaxValue = max(d.values())
    return ','.join([k for k, v in d.items() if v == itemMaxValue])

df["Most Frequent"] = df['Competition'].str.split(',').apply(Counter).apply(keywithmaxval)

输出 :

这给了我们:

df
     City   Name                             Competition Most Frequent
0    Pune   John                     Chess,Drawing,Chess         Chess
1  Mumbai   Boby  Table Tennis,Table Tennis,Chess,Carrom  Table Tennis
2    Pune   John                            Chess,Carrom  Chess,Carrom
3  Mumbai   Boby          Table Tennis,Chess,Chess,Chess         Chess
4    Pune  Nicky                                  Carrom        Carrom

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM