计算字段中单词/字符的出现次数

Question

我有类似于以下示例的网站访问者数据：

ID	页面
001	/冰淇淋，/百吉饼，/百吉饼/口味
002	/pizza, /pizza/flavors, /pizza/recipe

我想转换到下面，在这里我可以计算他们访问我网站中处理特定内容的部分的次数。 以逗号分隔的所有页面浏览量的一般计数也很好。

ID	百吉饼计数
001	2
002	0

ID	比萨计数
001	0
002	3

ID	total_pages_count
001	3
002	3

我可以选择在 SQL 或 Python 中执行，但我不确定哪个更容易，因此我问这个问题。

我参考了以下问题，但它们不符合我的目的：

Answer 1

我们可以进行split然后explode并使用crosstab得到您的结果

df['pages'] = df.pages.str.split(r'[/, ]')
s = df.explode('pages')
out = pd.crosstab(s['id'], s['pages']).drop('', axis=1)
out
Out[427]: 
pages  bagels  flavors  ice-cream  pizza  recipe
id                                              
1           2        1          1      0       0
2           0        1          0      3       1

Answer 2

如果您更喜欢 SQL，我会选择 go 这条路线。 我通常将重点放在报告应用程序上，但如果你真的坚持，Snowflake 有很好的文档供你从这里获取

with cte (id, pages) as

(select '001', '/ice-cream, /bagels, /bagels/flavors' union all
 select '002', '/pizza, /pizza/flavors, /pizza/recipe')
  
  
select id, 
       t2.value, 
       count(*) as word_count,
       length(pages)-length(replace(pages,',',''))+1 as user_page_count
from cte, lateral split_to_table(translate(cte.pages, '- ,','/'),'/') as t2--normalize word delimiters using translate(similar to replace)
where t2.value in ('bagels','pizza') --your list goes here
group by id, pages, t2.value;

Answer 3

我个人喜欢将正则表达式与组一起使用，然后分解成一个 df，我将它合并回 main。 与split方法相比，这有几个优点，主要是节省过多的 memory 使用，从而显着提高性能。

import re
from typing import List, Dict
import pandas as pd

my_words = [
    'bagels',
    'flavors',
    'ice-cream', 
    'pizza',
    'recipe'
]

def count_words(string:str, words:List[str]=my_words) -> Dict[str, int]:
    """
    Returns a dictionary of summated values
    for selected words contained in string
    """
    
    # Create a dictionary to return values
    match_dict = {x:0 for x in words}
    
    # Numbered capture groups with word boundaries
    # Note this will not allow pluralities, unless specified
    # Also: cache (or frontload) this value to improve performance
    my_regex_string = '|'.join((fr'\b({x})\b' for x in words))
    my_pattern = re.compile(my_regex_string)
    
    for match in my_pattern.finditer(string):
        value = match.group()
        match_dict[value] +=1
    
    return match_dict


# Create a new df with values from function
new_df = df['pages'].apply(match_words).apply(pd.Series)

    bagels  flavors ice-cream   pizza   recipe
0   2   1   1   0   0
1   0   1   0   3   1


# Merge back to the main df
df[['id']].merge(new_df, left_index=True, right_index=True)

id  bagels  flavors ice-cream   pizza   recipe
0   1   2   1   1   0   0
1   2   0   1   0   3   1

Answer 4

由于其优雅而将@BENY的答案标记为正确，但我在 python 中找到了一种方法，专注于特定关键字 - 假设df看起来像我的原始表

df['bagel_count'] = df["pages"].str.count('bagel')

计算字段中单词/字符的出现次数

问题描述

4 个解决方案

解决方案1
3 已采纳 2021-11-29 15:01:40

解决方案2
1 2021-11-29 15:51:54

解决方案3
1 2021-11-29 16:59:41

解决方案4
0 2021-11-29 15:43:36

计算字段中单词/字符的出现次数

问题描述

4 个解决方案

解决方案1 3 已采纳 2021-11-29 15:01:40

解决方案2 1 2021-11-29 15:51:54

解决方案3 1 2021-11-29 16:59:41

解决方案4 0 2021-11-29 15:43:36

解决方案1
3 已采纳 2021-11-29 15:01:40

解决方案2
1 2021-11-29 15:51:54

解决方案3
1 2021-11-29 16:59:41

解决方案4
0 2021-11-29 15:43:36