N-Gram Analysis in Python

Question

Here is what my sample data looks like:

I need to conduct 1-2 gram on query, and calculate the sum and average of impression associated with the queries. Now I've figured out how to aggregate the impressions using the codes below.

def n_grams(txt):
grams = list()
words = txt.split(' ')
for i in range(len(words)):
    for k in range(1, len(words) - i + 1):
        grams.append(" ".join(words[i:i+k]))
return pd.Series(grams)


counts = df['query'].apply(n_grams).join(df)
result = counts.drop("query", axis=1).set_index("impression").unstack()    .rename("ngram").dropna().reset_index()    .drop("level_0", 
axis=1).groupby("ngram")["impression"].sum()
result = result.to_frame()
result['query'] = result.index
result['ngram'] =result['query'].str.split().apply(len)
result = result.groupby(['ngram','query'])['impression'].sum()
result = result.reset_index()
result = result.sort_values(['ngram', 'impression'], ascending=[True, False])

The results return like:

Here I need an another column to show the average impression associated with those queries. For example, the word "nutrition" appear four times, so the avg impression should be 100/4 = 25. Also, I want to show how many times this query appear in another column. The ultimate result should look like this:

Answer 1

This code will help you get the count of unigrams such as 'nutrition' from bigrams.

2gram=result[result['ngram']==2]
2gram=2gram.reset_index()
#create an empty dictionary to store count of words in bigrams
words=dict()
for i in range(0,len(2gram):
    query_wrds=2gram.loc[i,'query'].split()
        for item in query_words:
            if item not in words:
                words[item]=1
            else:
                words[item]+=1
#to get count of word 'nutrition'
nut_ct=words['nutrition']

N-Gram Analysis in Python

Question

1 answers

solution1
0 2017-06-07 20:37:21

N-Gram Analysis in Python

Question

1 answers

solution1 0 2017-06-07 20:37:21

solution1
0 2017-06-07 20:37:21