[英]How to count 500 most common words in pandas dataframe
I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts. 我在一个名为Text(每行1个文本)的列中有一个包含500个文本的数据框,我想计算所有文本中最常用的单词。
I tried so far (both methods from stackoverflow): 我到目前为止尝试过(来自stackoverflow的两种方法):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
and 和
Counter(" ".join(df["Text"]).split()).most_common(100)
both gave me the following error: 两个都给了我以下错误:
TypeError: sequence item 0: expected str instance, list found
TypeError:序列项0:预期的str实例,找到列表
And i have tried the counter method simply with 而我已经尝试过简单的计数器方法
df.Text.apply(Counter())
which gave me the word count in each text and i also altered the counter method so it returned the most common words in each text 它给了我每个文本中的单词计数,我也改变了计数器方法,因此它返回了每个文本中最常用的单词
But i want the overall most common words 但我想要总体上最常见的词汇
Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed) 这是一个数据帧的示例(文本已经小写,从标点符号清除,标记化,并删除停用词)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
Edit: Code to 'reporduce' it 编辑:代码'重新定义'它
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
new cell 新细胞
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^\w\s]+')
d = re.compile(r'\d+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('\n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
And a description of the dataframe: 以及数据帧的描述:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
Thanks in advance 提前致谢
Ok, I got it. 好,我知道了。 Your
df['Text']
consists of lists of texts. 你的
df['Text']
包含文本列表。 So you can do this: 所以你可以这样做:
full_list = [] # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. 此解决方案避免使用过多的列表推导,从而使代码易于阅读和理解。 Furthermore no additional modules like
re
or collections
are needed. 此外,不需要额外的模块,如
re
或collections
。
Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter: 这是我的版本,我将列值转换为列表,然后我创建一个单词列表,清理它,你有你的计数器:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
You can do it via apply
and Counter.update
methods: 您可以通过
apply
和Counter.update
方法来完成:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
Where df['Text']
is: 其中
df['Text']
是:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.