简体   繁体   English

计数匹配实例然后总结值

[英]Count match instances then sum values

I would like to get this result from these two DataFrames我想从这两个数据帧中得到这个结果

df1 = pd.DataFrame({'url': [
  'http://google.com/men', 
  'http://google.com/women', 
  'http://google.com/men-shoes',
  'http://google.com/women-shoes',
  'http://google.com/not-important',
], 'click': [3, 4, 6, 5, 8]})

df2 = pd.DataFrame({'keyword': ['men','women','shoes', 'kids']})

Result:结果:

  keyword  instances  clicks
0     men          2     9.0
1   women          2     9.0
2   shoes          2     11.0
3    kids          0     0.0

Which is basically counting how many times any df2 keywords appears on any df1 url column then merge to check those rows for a cumulative sum of click column on df1这基本上是计算任何df2关键字出现在任何df1 url列上的次数,然后合并以检查这些行以获取df1click列的累积总和

I am struggling to get this result, thanks.我正在努力得到这个结果,谢谢。

You can use my fuzzy_merge function I wrote, combining it with explode and groupby , we get quite close to your result, note this is still fuzzy matching, so that's why there's a difference.你可以用我的fuzzy_merge功能我写的,但相结合explodegroupby ,我们得到相当接近你的结果,请注意这仍然是模糊匹配,所以这就是为什么有区别。

You can try to play with the threshold argument to get your desired result:您可以尝试使用threshold参数来获得所需的结果:

mrg = (
    fuzzy_merge(df1, df2, 'url', 'keyword')
     .explode('matches')
     .groupby('matches').agg({'matches':'size',
                              'click':'sum'})
)

df2['instances'] = df2['keyword'].map(mrg['matches']).fillna(0)
df2['clicks'] = df2['keyword'].map(mrg['click']).fillna(0)

  keyword  instances  clicks
0     men        2.0     7.0
1   women        2.0     9.0
2   shoes        2.0    11.0
3    kids        0.0     0.0

Function used from linked answer:链接答案中使用的函数:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match, based on Levenshtein distance
    limit is the amount of matches that will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold])
    df_1['matches'] = m2

    return df_1

You can try this: It will extract the last part of URL after the / and split it with - (maybe it will be enough for your case):你可以试试这个:它会提取/之后的 URL 的最后一部分,并用-分割它(也许这对你的情况就足够了):

df1['keyword'] = df1['url'].str.extract(r'/([^/]+?)$')[0].str.split(r'-')
print( pd.merge(df1.explode('keyword'), df2, how='right')
         .groupby('keyword').agg({'click': 'sum', 'url': lambda x: x[~x.isna()].count()  })
         .rename(columns={'click': 'clicks', 'url':'instances'})
         .reset_index() )

Prints:印刷:

  keyword  clicks  instances
0    kids     0.0          0
1     men     9.0          2
2   shoes    11.0          2
3   women     9.0          2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM