[英]Count match instances then sum values
I would like to get this result from these two DataFrames我想从这两个数据帧中得到这个结果
df1 = pd.DataFrame({'url': [
'http://google.com/men',
'http://google.com/women',
'http://google.com/men-shoes',
'http://google.com/women-shoes',
'http://google.com/not-important',
], 'click': [3, 4, 6, 5, 8]})
df2 = pd.DataFrame({'keyword': ['men','women','shoes', 'kids']})
Result:结果:
keyword instances clicks
0 men 2 9.0
1 women 2 9.0
2 shoes 2 11.0
3 kids 0 0.0
Which is basically counting how many times any df2
keywords appears on any df1
url
column then merge to check those rows for a cumulative sum of click
column on df1
这基本上是计算任何
df2
关键字出现在任何df1
url
列上的次数,然后合并以检查这些行以获取df1
上click
列的累积总和
I am struggling to get this result, thanks.我正在努力得到这个结果,谢谢。
You can use my fuzzy_merge
function I wrote, combining it with explode
and groupby
, we get quite close to your result, note this is still fuzzy matching, so that's why there's a difference.你可以用我的
fuzzy_merge
功能我写的,但相结合explode
和groupby
,我们得到相当接近你的结果,请注意这仍然是模糊匹配,所以这就是为什么有区别。
You can try to play with the threshold
argument to get your desired result:您可以尝试使用
threshold
参数来获得所需的结果:
mrg = (
fuzzy_merge(df1, df2, 'url', 'keyword')
.explode('matches')
.groupby('matches').agg({'matches':'size',
'click':'sum'})
)
df2['instances'] = df2['keyword'].map(mrg['matches']).fillna(0)
df2['clicks'] = df2['keyword'].map(mrg['click']).fillna(0)
keyword instances clicks
0 men 2.0 7.0
1 women 2.0 9.0
2 shoes 2.0 11.0
3 kids 0.0 0.0
Function used from linked answer:链接答案中使用的函数:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold])
df_1['matches'] = m2
return df_1
You can try this: It will extract the last part of URL after the /
and split it with -
(maybe it will be enough for your case):你可以试试这个:它会提取
/
之后的 URL 的最后一部分,并用-
分割它(也许这对你的情况就足够了):
df1['keyword'] = df1['url'].str.extract(r'/([^/]+?)$')[0].str.split(r'-')
print( pd.merge(df1.explode('keyword'), df2, how='right')
.groupby('keyword').agg({'click': 'sum', 'url': lambda x: x[~x.isna()].count() })
.rename(columns={'click': 'clicks', 'url':'instances'})
.reset_index() )
Prints:印刷:
keyword clicks instances
0 kids 0.0 0
1 men 9.0 2
2 shoes 11.0 2
3 women 9.0 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.