简体   繁体   中英

Counting the top 10 most frequent words per row

I have the sample dataset like this:

"Author", "Normal_Tokenized"  
x       , ["I","go","to","war","I",..]  
y       , ["me","you","and","us",..]
z       , ["let","us","do","our","best",..]

I want a dataframe reporting the 10 most frequent words and the counts (frequencies) for each author:

"x_text", "x_count", "y_text", "y_count", "z_text", "z_count"  
go ,        1000   ,  come   ,  120     , let     , 12

and so on ...

I attempted with the following snippet, but it just take the last author values instead of all authors values.

This code actually return the 10 most common word the author has been used in his novel

df_words = pd.concat([pd.DataFrame(
    data={'Author': [row['Author'] for _ in row['Normal_Tokenized']], 'Normal_Tokenized': row['Normal_Tokenized']})
    for idx, row in df.iterrows()], ignore_index=True)
df_words = df_words[~df_words['Normal_Tokenized'].isin(stop_words)]

def authorCommonWords(numWords):
    for author in authors:
        authorWords = df_words[df_words['Author'] == author].groupby('Normal_Tokenized').size().reset_index().rename(
            columns={0: 'Count'})
        authorWords.sort_values('Count', inplace=True)
        df = pd.DataFrame(authorWords[-numWords:])
    df.to_csv("common_word.csv", header=False,mode='a', encoding='utf-8',
                  index=False)
    return authorWords[-numWords:]

authorCommonWords(10)

There are about 130000 samples for each author. The example get the 10 word that is most repeated word in this 130000 sample. I want this 10 words in separated column for each author.

np.unique(return_counts=True) seems to be what you're looking for.

Data

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "Author": ["x", "y", "z"],
    "Normal_Tokenized": [["I","go","to","war","I"],
                         ["me","you","and","us"],
                         ["let","us","do","our","best"]]
})

Code

n_top = 6  # count top n

df_want = pd.DataFrame(index=range(n_top))
for au, ls in df.itertuples(index=False, name=None):
    words, freqs = np.unique(ls, return_counts=True)
    len_words = len(words)
    if len_words >= n_top:
        df_want[f"{au}_text"] = words[:n_top]
        df_want[f"{au}_count"] = freqs[:n_top]
    else:  # too few distinct words
        df_want[f"{au}_text"] = [words[i] if i < len_words else "" for i in range(n_top)]
        df_want[f"{au}_count"] = [freqs[i] if i < len_words else 0 for i in range(n_top)]

Result

print(df_want)
  x_text  x_count y_text  y_count z_text  z_count
0      I        2    and        1   best        1
1     go        1     me        1     do        1
2     to        1     us        1    let        1
3    war        1    you        1    our        1
4               0               0     us        1
5               0               0               0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM