简体   繁体   中英

Count occurences of word/character in a field

I have website visitor data that resembles the example below:

id pages
001 /ice-cream, /bagels, /bagels/flavors
002 /pizza, /pizza/flavors, /pizza/recipe

I would like to transform to below, where I can count the amount of times they have visited a part of my site that deals with specific content. A general count of all pageviews, delimited by comma, would be great as well.

id bagel_count
001 2
002 0
id pizza_count
001 0
002 3
id total_pages_count
001 3
002 3

I have the option to perform in SQL or Python but I am not sure what is easier, hence why I am asking the question.

I have referenced following questions but they are not serving my purpose:

We can do split then explode and get your result with crosstab

df['pages'] = df.pages.str.split(r'[/, ]')
s = df.explode('pages')
out = pd.crosstab(s['id'], s['pages']).drop('', axis=1)
out
Out[427]: 
pages  bagels  flavors  ice-cream  pizza  recipe
id                                              
1           2        1          1      0       0
2           0        1          0      3       1

I would go this route if you prefer SQL. I typically leave pivoting to reporting applications, but if you really insist, Snowflake has good documentation on it for you to take it from here

with cte (id, pages) as

(select '001', '/ice-cream, /bagels, /bagels/flavors' union all
 select '002', '/pizza, /pizza/flavors, /pizza/recipe')
  
  
select id, 
       t2.value, 
       count(*) as word_count,
       length(pages)-length(replace(pages,',',''))+1 as user_page_count
from cte, lateral split_to_table(translate(cte.pages, '- ,','/'),'/') as t2--normalize word delimiters using translate(similar to replace)
where t2.value in ('bagels','pizza') --your list goes here
group by id, pages, t2.value;

I personally like to use regular expressions with groups, then explode into a df which I merge back into the main. This has several advantages over the split method, mainly, conserving excessive memory usage which results in dramatic performance improvement.

import re
from typing import List, Dict
import pandas as pd

my_words = [
    'bagels',
    'flavors',
    'ice-cream', 
    'pizza',
    'recipe'
]

def count_words(string:str, words:List[str]=my_words) -> Dict[str, int]:
    """
    Returns a dictionary of summated values
    for selected words contained in string
    """
    
    # Create a dictionary to return values
    match_dict = {x:0 for x in words}
    
    # Numbered capture groups with word boundaries
    # Note this will not allow pluralities, unless specified
    # Also: cache (or frontload) this value to improve performance
    my_regex_string = '|'.join((fr'\b({x})\b' for x in words))
    my_pattern = re.compile(my_regex_string)
    
    for match in my_pattern.finditer(string):
        value = match.group()
        match_dict[value] +=1
    
    return match_dict


# Create a new df with values from function
new_df = df['pages'].apply(match_words).apply(pd.Series)

    bagels  flavors ice-cream   pizza   recipe
0   2   1   1   0   0
1   0   1   0   3   1


# Merge back to the main df
df[['id']].merge(new_df, left_index=True, right_index=True)

id  bagels  flavors ice-cream   pizza   recipe
0   1   2   1   1   0   0
1   2   0   1   0   3   1

Going to mark @BENY's answer correct because of its elegance but I found a way to do this in python, focusing on a specific keyword - Assume df looks like my original table

df['bagel_count'] = df["pages"].str.count('bagel')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM