I have the sample dataset like this:
"Author", "Normal_Tokenized"
x , ["I","go","to","war","I",..]
y , ["me","you","and","us",..]
z , ["let","us","do","our","best",..]
I want a dataframe reporting the 10 most frequent words and the counts (frequencies) for each author:
"x_text", "x_count", "y_text", "y_count", "z_text", "z_count"
go , 1000 , come , 120 , let , 12
and so on ...
I attempted with the following snippet, but it just take the last author values instead of all authors values.
This code actually return the 10 most common word the author has been used in his novel
df_words = pd.concat([pd.DataFrame(
data={'Author': [row['Author'] for _ in row['Normal_Tokenized']], 'Normal_Tokenized': row['Normal_Tokenized']})
for idx, row in df.iterrows()], ignore_index=True)
df_words = df_words[~df_words['Normal_Tokenized'].isin(stop_words)]
def authorCommonWords(numWords):
for author in authors:
authorWords = df_words[df_words['Author'] == author].groupby('Normal_Tokenized').size().reset_index().rename(
columns={0: 'Count'})
authorWords.sort_values('Count', inplace=True)
df = pd.DataFrame(authorWords[-numWords:])
df.to_csv("common_word.csv", header=False,mode='a', encoding='utf-8',
index=False)
return authorWords[-numWords:]
authorCommonWords(10)
There are about 130000 samples for each author. The example get the 10 word that is most repeated word in this 130000 sample. I want this 10 words in separated column for each author.
np.unique(return_counts=True)
seems to be what you're looking for.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Author": ["x", "y", "z"],
"Normal_Tokenized": [["I","go","to","war","I"],
["me","you","and","us"],
["let","us","do","our","best"]]
})
n_top = 6 # count top n
df_want = pd.DataFrame(index=range(n_top))
for au, ls in df.itertuples(index=False, name=None):
words, freqs = np.unique(ls, return_counts=True)
len_words = len(words)
if len_words >= n_top:
df_want[f"{au}_text"] = words[:n_top]
df_want[f"{au}_count"] = freqs[:n_top]
else: # too few distinct words
df_want[f"{au}_text"] = [words[i] if i < len_words else "" for i in range(n_top)]
df_want[f"{au}_count"] = [freqs[i] if i < len_words else 0 for i in range(n_top)]
print(df_want)
x_text x_count y_text y_count z_text z_count
0 I 2 and 1 best 1
1 go 1 me 1 do 1
2 to 1 us 1 let 1
3 war 1 you 1 our 1
4 0 0 us 1
5 0 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.