Here is what my sample data looks like:
I need to conduct 1-2 gram on query, and calculate the sum and average of impression associated with the queries. Now I've figured out how to aggregate the impressions using the codes below.
def n_grams(txt):
grams = list()
words = txt.split(' ')
for i in range(len(words)):
for k in range(1, len(words) - i + 1):
grams.append(" ".join(words[i:i+k]))
return pd.Series(grams)
counts = df['query'].apply(n_grams).join(df)
result = counts.drop("query", axis=1).set_index("impression").unstack() .rename("ngram").dropna().reset_index() .drop("level_0",
axis=1).groupby("ngram")["impression"].sum()
result = result.to_frame()
result['query'] = result.index
result['ngram'] =result['query'].str.split().apply(len)
result = result.groupby(['ngram','query'])['impression'].sum()
result = result.reset_index()
result = result.sort_values(['ngram', 'impression'], ascending=[True, False])
The results return like:
Here I need an another column to show the average impression associated with those queries. For example, the word "nutrition" appear four times, so the avg impression should be 100/4 = 25. Also, I want to show how many times this query appear in another column. The ultimate result should look like this:
This code will help you get the count of unigrams such as 'nutrition' from bigrams.
2gram=result[result['ngram']==2]
2gram=2gram.reset_index()
#create an empty dictionary to store count of words in bigrams
words=dict()
for i in range(0,len(2gram):
query_wrds=2gram.loc[i,'query'].split()
for item in query_words:
if item not in words:
words[item]=1
else:
words[item]+=1
#to get count of word 'nutrition'
nut_ct=words['nutrition']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.