简体   繁体   中英

N-Gram Analysis in Python

Here is what my sample data looks like:

在此处输入图片说明

I need to conduct 1-2 gram on query, and calculate the sum and average of impression associated with the queries. Now I've figured out how to aggregate the impressions using the codes below.

def n_grams(txt):
grams = list()
words = txt.split(' ')
for i in range(len(words)):
    for k in range(1, len(words) - i + 1):
        grams.append(" ".join(words[i:i+k]))
return pd.Series(grams)


counts = df['query'].apply(n_grams).join(df)
result = counts.drop("query", axis=1).set_index("impression").unstack()    .rename("ngram").dropna().reset_index()    .drop("level_0", 
axis=1).groupby("ngram")["impression"].sum()
result = result.to_frame()
result['query'] = result.index
result['ngram'] =result['query'].str.split().apply(len)
result = result.groupby(['ngram','query'])['impression'].sum()
result = result.reset_index()
result = result.sort_values(['ngram', 'impression'], ascending=[True, False])

The results return like:

在此处输入图片说明

Here I need an another column to show the average impression associated with those queries. For example, the word "nutrition" appear four times, so the avg impression should be 100/4 = 25. Also, I want to show how many times this query appear in another column. The ultimate result should look like this: 在此处输入图片说明

This code will help you get the count of unigrams such as 'nutrition' from bigrams.

2gram=result[result['ngram']==2]
2gram=2gram.reset_index()
#create an empty dictionary to store count of words in bigrams
words=dict()
for i in range(0,len(2gram):
    query_wrds=2gram.loc[i,'query'].split()
        for item in query_words:
            if item not in words:
                words[item]=1
            else:
                words[item]+=1
#to get count of word 'nutrition'
nut_ct=words['nutrition']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM