![](/img/trans.png)
[英]removing stopwords fails when using nltk stopwords to remove them from a list in a pandas column
[英]check number of stopwords in a text column pandas
如何檢查文本列熊貓中存在的停用詞數量。 我有一個龐大的數據集,因此非常感謝高效的方法。
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(df)
text
0 stackoverflow is good
1 stackoverflow is not good
這是我想要的輸出?
print(df)
text number_of_stopwords
0 stackoverflow is good 1
1 stackoverflow is not good 2
我已經嘗試過類似下面的方法,但是無法正常工作。
df.str.split().apply(lambda x: len(x in stop_words))
使用交集:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['n'] = df['text'].str.split().apply(lambda x: len(set(x) & stop_words))
要么:
df['n'] = df['text'].apply(lambda x: len(set(x.split()) & stop_words))
print (df)
text n
0 stackoverflow is good 1
1 stackoverflow is not good 2
您可以使用LC,
df['number_of_stopwords'] = df.text.apply(lambda x: len([i for i in x.split() if i in stop_words]))
df['not_in_stopwords'] = df.text.apply(lambda x: len([i for i in x.split() if I not in stop_words]))
在效果方面表現不錯,
df = df.append([df]*1000000,ignore_index=True)
%timeit df.text.apply(lambda x: len([i for i in x.split() if i in stop_words]))
2.27 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df['text'].str.split().apply(lambda x: len(set(x) & stop_words))
3.29 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.