I have created a dataframe with just a column with the subject line.
df = activities.filter(['Subject'],axis=1)
df.shape
This returned this dataframe:
Subject
0 Call Out: Quadria Capital - May Lo, VP
1 Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2 Columbia Partners: WW Worked (Not Sure Will Ev...
3 Meeting, Sophie, CFO, CDC Investment
4 Prospecting
I then tried to analyse the text with this code:
import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)
The error message I get is: 'Series' object has no attribute 'Subject'
The error is being thrown because you have converted df
to a Series in this line:
df = activities.filter(['Subject'],axis=1)
So when you say:
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
df is the Series and does not have the attribute Series. Try replacing with:
txt = df.str.lower().str.replace(r'\|', ' ')
Or alternatively don't filter your DataFrame to a single Series before and then
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
should work.
[UPDATE]
What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.
Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting
# read in the data
df = pd.read_clipboard(sep=',')
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
creates a pandas.core.series.Series
and will be replaced words = nltk.tokenize.word_tokenize(txt)
, throws a TypeError
because txt
is a Series
.
list
. In this example, looking at df
will show a tok
column, where each row is a listimport nltk
import pandas as pd
top_N = 50
# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')
# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)
words
.# all tokenized words to a list
words = df.tok.tolist() # this is a list of lists
words = [word for list_ in words for word in list_]
# frequency distribution
word_dist = nltk.FreqDist(words)
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
rslt
: Word Frequency
call 2
out 2
quadria 1
capital 1
may 1
lo 1
vp 1
revelstoke 1
anthony 1
hayes 1
sr 1
assoc 1
columbia 1
partners 1
ww 1
worked 1
not 1
sure 1
will 1
ev 1
meeting 1
sophie 1
cfo 1
cdc 1
investment 1
prospecting 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.