简体   繁体   中英

How to count 500 most common words in pandas dataframe

I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.

I tried so far (both methods from stackoverflow):

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

and

Counter(" ".join(df["Text"]).split()).most_common(100)

both gave me the following error:

TypeError: sequence item 0: expected str instance, list found

And i have tried the counter method simply with

df.Text.apply(Counter()) 

which gave me the word count in each text and i also altered the counter method so it returned the most common words in each text

But i want the overall most common words

Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)

    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum                                                   
2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Edit: Code to 'reporduce' it

  for file in file_list:
    name = file[len(input_path):]
        date = name[11:17]
        type_1 = name[17:20] 


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
                format
                text = rfile.read()
                text = text.encode('utf-8', 'ignore')
                text = text.decode('utf-8', 'ignore')
     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
        result_list.append(a)

new cell

  df['Text']= df['Text'].str.lower()
    p = re.compile(r'[^\w\s]+')
    d = re.compile(r'\d+')
    for index, row in df.iterrows():
        df['Text']=df['Text'].str.replace('\n',' ')
        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
    df['Text']=df['Text'].apply(word_tokenize)


    Datum   File    File_type   Text    length  the
Datum                       
2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0
2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0
2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0
2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0
2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0
2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0
2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0
2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0
2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0
2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

And a description of the dataframe:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum               557 non-null datetime64[ns]
File                557 non-null object
File_type           557 non-null object
Text                557 non-null object
customers           557 non-null int64
grwoth              557 non-null int64
human               557 non-null int64
intagibles          557 non-null int64
length              557 non-null int64
synergies           557 non-null int64
technology          557 non-null int64
the                 557 non-null int64
len_cleaned_text    557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB

Thanks in advance

Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:

full_list = []  # list containing all words of all texts
for elmnt in df['Text']:  # loop over lists in df
    full_list += elmnt  # append elements of lists to full list

val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:

your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 

counter = collections.Counter(flat_list)
top_words = counter.most_common(100)

You can do it via apply and Counter.update methods:

from collections import Counter

counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))

counter.most_common(10) 
Out:

[('Amy', 3), ('was', 3), ('hated', 2),
 ('Kamal', 2), ('her', 2), ('and', 2), 
 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is:

0    [Amy, normally, hated, Monday, mornings, but, ...
1    [Kamal, was, in, her, art, class, and, she, li...
2    [She, was, waiting, outside, the, classroom, w...
3              [Hi, Amy, Your, mum, sent, me, a, text]
4                         [You, forgot, your, inhaler]
5    [Why, don’t, you, turn, your, phone, on, Amy, ...
6    [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM