[英]how to remove punctuation and stop words using python
我有一個 python function 必須刪除標點符號和停用詞,但是當我打印結果時它不會刪除它。
我的 function 中的錯誤在哪里?
代碼:
from nltk.corpus import stopwords
from string import punctuation
ppt = '''...!@#$%^&*(){}[]|._-`/?:;"'\,~12345678876543'''
def text_process(raw_text):
'''
parameters:
=========
raw_text: text as input
functions:
==========
- remove all punctuation
- remove all stop words
- return a list of the cleaned text
'''
#check characters to see if they are in punctuation
nopunc = [char for char in list(raw_text)if char not in ppt]
# join the characters again to form the string
nopunc = "".join(nopunc)
#now just remove ant stopwords
return [word for word in nopunc.lower().split() if word.lower() not in stopwords.words("english")]
def_test_twtr_preds["tokens"] = def_test_twtr_preds["processed_TEXT"].apply(text_process)
#get most common words in dataset
all_words = []
for line in list(def_test_twtr_preds["processed_TEXT"]):
words = line.split()
for word in words:
all_words.append(word.lower())
print("Most common words:\n{}".format(Counter(all_words).most_common(10)))
當我顯示數據集中存在的最常見單詞的結果是:
Most common words:
[('the', 281), ('and', 103), ('words', 81), ('…', 70), ('are', 61), ('word', 57), ('for', 55), ('you', 48), ('this', 40), ('.', 34)]
注意list('your text')
將導致['y','o','u','r','t','e','x','t']
而不是['your', 'text']
。
您可以使用nopunc = [w for w in text_raw.split() if w.isalpha()]
刪除標點符號
然而,上面的代碼也將刪除I'm
in in I'm fine
這個詞。 所以如果你想得到['I','m','fine']
,你可以使用下面的代碼:
tokenizer = nltk.RegexpTokenizer(r"\w+")
nopunc = tokenizer.tokenize(raw_text)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.