计算熊猫列中的唯一单词

Question

我在处理以下数据时遇到了一些困难（来自熊猫数据框）：

Text
0   Selected moments from Fifa game t...
1   What I learned is that I am ...
3   Bill Gates kept telling us it was comi...
5   scenario created a month before the...
... ...
1899    Events for May 19 – October 7 - October CTOvision.com
1900    Office of Event Services and Campus Center Ope...
1901    How the CARES Act May Affect Gift Planning in ...
1902    City of Rohnert Park: Home
1903    iHeartMedia, Inc.

我需要提取每行唯一单词的数量（删除标点符号后）。 因此，例如：

我尝试这样做：

df["Unique"]=df['Text'].str.lower()
df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))

但我没有得到任何计数，只有一个单词列表（没有它们在该行中的频率）。

你能告诉我有什么问题吗？

Answer 1

如果不需要计数，请先删除所有标点符号。 杠杆集。 str.split.map(set)会给你一个集合。 在那里计算集合中的元素。 集合不采用多个唯一元素。

链式

df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()

逐步

df[Text]=df['Text'].str.replace(r'[^\w\s]+', '')
df['New Text']=df.Text.str.split().map(set).str.len()

Answer 2

所以，我只是根据评论更新这个。 此解决方案也考虑了标点符号。

df['Unique'] =  df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)

Answer 3

尝试这个

from collections import Counter

dict = {'A': {0:'John', 1:'Bob'},
        'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates  and Larry Ellison'}}

df = pd.DataFrame(dict)
df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')
print(df.loc[:,"Desc"])
 
print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())
print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))

计算熊猫列中的唯一单词

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-10-21 00:01:13

解决方案2
1 2020-10-21 00:05:37

解决方案3
0 2020-10-21 00:42:49

计算熊猫列中的唯一单词

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-10-21 00:01:13

解决方案2 1 2020-10-21 00:05:37

解决方案3 0 2020-10-21 00:42:49

解决方案1
2 已采纳 2020-10-21 00:01:13

解决方案2
1 2020-10-21 00:05:37

解决方案3
0 2020-10-21 00:42:49