简体   繁体   中英

Text analysis: finding the most common word in a column using python

I have created a dataframe with just a column with the subject line.

df = activities.filter(['Subject'],axis=1)
df.shape

This returned this dataframe:

    Subject
0   Call Out: Quadria Capital - May Lo, VP
1   Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2   Columbia Partners: WW Worked (Not Sure Will Ev...
3   Meeting, Sophie, CFO, CDC Investment
4   Prospecting

I then tried to analyse the text with this code:

import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)

The error message I get is: 'Series' object has no attribute 'Subject'

The error is being thrown because you have converted df to a Series in this line:

df = activities.filter(['Subject'],axis=1)

So when you say:

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

df is the Series and does not have the attribute Series. Try replacing with:

txt = df.str.lower().str.replace(r'\|', ' ')

Or alternatively don't filter your DataFrame to a single Series before and then

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

should work.

[UPDATE]

What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.

Data:

Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting

# read in the data
df = pd.read_clipboard(sep=',')

在此处输入图像描述

Updated code:

  • Convert all words to lowercase and remove all non-alphanumeric characters
    • txt = df.Subject.str.lower().str.replace(r'\|', ' ') creates a pandas.core.series.Series and will be replaced
  • words = nltk.tokenize.word_tokenize(txt) , throws a TypeError because txt is a Series .
    • The following code tokenizes each row of the dataframe
  • Tokenizing the words, splits each string into a list . In this example, looking at df will show a tok column, where each row is a list
import nltk
import pandas as pd

top_N = 50

# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')

# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)

在此处输入图像描述

  • To analyze all the words in the column, the individual rows lists are combined into a single list, called words .
# all tokenized words to a list
words = df.tok.tolist()  # this is a list of lists
words = [word for list_ in words for word in list_]

# frequency distribution
word_dist = nltk.FreqDist(words)

# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)

# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])

Output rslt :

        Word  Frequency
        call          2
         out          2
     quadria          1
     capital          1
         may          1
          lo          1
          vp          1
  revelstoke          1
     anthony          1
       hayes          1
          sr          1
       assoc          1
    columbia          1
    partners          1
          ww          1
      worked          1
         not          1
        sure          1
        will          1
          ev          1
     meeting          1
      sophie          1
         cfo          1
         cdc          1
  investment          1
 prospecting          1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM