[英]Find trigrams for all groupby clusters in a Pandas Dataframe and return in a new column
我正在尝试为每组关键字在 pandas dataframe 的新列中返回频率最高的三元组。 (本质上类似于带有变换的 groupby,在新列中返回最高的三元组)。
带有虚拟数据的示例 dataframe
cluster_name keyword
0 summer summer dresses size 10
1 summer summer dresses size 12
2 summer large summer dresses
3 summer summer dresses size 14
4 strappy ladies strappy summer dresses
5 strappy strappy summer dresses uk 2022
6 strappy strappy summer dress
7 strappy strappy summer dresses
8 strappy thin strap summer dresses
所需 Output
cluster_name trigram
0 summer summer dresses size
4 strappy strappy summer dresses
最小可重现示例
import pandas as pd
data = [
["summer", "summer dresses size 10"],
["summer", "summer dresses size 12"],
["summer", "large summer dresses"],
["summer", "summer dresses size 14"],
["strappy", "ladies strappy summer dresses"],
["strappy", "strappy summer dresses uk 2022"],
["strappy", "strappy summer dress"],
["strappy", "strappy summer dresses"],
["strappy", "thin strap summer dresses"],
]
df = pd.DataFrame(data, columns=['cluster_name', 'keyword'])
print(df)
我试过的。
我有工作代码来查找二元组,但它有点 hacky。 虽然它很快(比 itrows 快得多,我很想避免)。 它取自这个解决方案: How to get group-by and get most recent words and bigrams for each group pandas
理想的结果将是一个通用的解决方案,我可以稍微修改一下,只需更改单个值即可返回一元、二元或三元等。
def bigram(row):
lst = row['keyword'].split(' ')
return bigrams.append([(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)])
df['parent_cluster'] = df.apply(lambda row: bigram(row), axis=1)
df2 = df.groupby('cluster_name').agg({'parent_cluster': 'sum'})
df3 = df2.parent_cluster.apply(lambda row: Counter(row)).to_frame().astype(str)
df3["parent_cluster"] = (df3["parent_cluster"].str.split(',').str[0])
# clean up the unigram column to remove the string of the Counter library.
df3["parent_cluster"] = df3["parent_cluster"].str.replace("Counter\({\('", '')
df3["parent_cluster"] = df3["parent_cluster"].str.replace("'", '')
您可以将nltk.ngrams
与explode
/ groupby
/ mode
结合使用:
from nltk import ngrams # or use a custom function
out = (df
.assign(keyword=[list(ngrams(s.split(), n=3)) for s in df['keyword']])
.explode('keyword')
.groupby('cluster_name')['keyword'].apply(lambda g: g.mode()[0])
)
output:
cluster_name
strappy (strappy, summer, dresses)
summer (summer, dresses, size)
Name: keyword, dtype: object
作为字符串:
out = (df
.assign(keyword=[[' '.join(x) for x in ngrams(s.split(), n=3)]
for s in df['keyword']])
.explode('keyword')
.groupby('cluster_name')['keyword'].apply(lambda g: g.mode()[0])
.reset_index(name='trigram')
)
output:
cluster_name trigram
0 strappy strappy summer dresses
1 summer summer dresses size
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.