[英]Count most frequent 100 words from sentences in Dataframe Pandas
[英]How to count 500 most common words in pandas dataframe
我在一個名為Text(每行1個文本)的列中有一個包含500個文本的數據框,我想計算所有文本中最常用的單詞。
我到目前為止嘗試過(來自stackoverflow的兩種方法):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
和
Counter(" ".join(df["Text"]).split()).most_common(100)
兩個都給了我以下錯誤:
TypeError:序列項0:預期的str實例,找到列表
而我已經嘗試過簡單的計數器方法
df.Text.apply(Counter())
它給了我每個文本中的單詞計數,我也改變了計數器方法,因此它返回了每個文本中最常用的單詞
但我想要總體上最常見的詞匯
這是一個數據幀的示例(文本已經小寫,從標點符號清除,標記化,並刪除停用詞)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
編輯:代碼'重新定義'它
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
新細胞
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^\w\s]+')
d = re.compile(r'\d+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('\n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
以及數據幀的描述:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
提前致謝
好,我知道了。 你的df['Text']
包含文本列表。 所以你可以這樣做:
full_list = [] # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
此解決方案避免使用過多的列表推導,從而使代碼易於閱讀和理解。 此外,不需要額外的模塊,如re
或collections
。
這是我的版本,我將列值轉換為列表,然后我創建一個單詞列表,清理它,你有你的計數器:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
您可以通過apply
和Counter.update
方法來完成:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
其中df['Text']
是:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.