使用NLTK阻止單詞（python）

Question

我是Python文本處理的新手，我試圖阻止文本文檔中的單詞，大約有5000行。

我寫了下面的腳本

from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

def Description_to_words(raw_Description ):
    # 1. Remove HTML
    Description_text = BeautifulSoup(raw_Description).get_text() 
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text) 
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                       

    stops = set(stopwords.words("english"))                  
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    # 5. stem words
    words = ([stemmer.stem(w) for w in words])

    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

clean_Description = Description_to_words(train["Description"][15])

但是當我測試結果詞沒有被阻止時，誰能幫助我知道問題所在，那么我在“ Description_to_words”函數中做錯了什么

而且，當我像下面分別執行干命令時，它可以工作。

from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>> 
>>> for w in words:
...     print(stemmer.stem(w))
... 
mobil
app
-
unabl
to
add
read

Answer 1

這是固定功能的每個步驟。

刪除HTML。

 Description_text = BeautifulSoup(raw_Description).get_text()

刪除非字母，但暫時不要刪除空格。 您還可以稍微簡化一下正則表達式。
```
 letters_only = re.sub("[^\\w\\s]", " ", Description_text) 
```
轉換為小寫，分割成單個單詞：我建議在這里再次使用word_tokenize 。
```
 from nltk.tokenize import word_tokenize words = word_tokenize(letters_only.lower()) 
```

刪除停用詞。

 stops = set(stopwords.words("english")) meaningful_words = [w for w in words if not w in stops]

詞干。 這是另一個問題。 阻止meaningful_words words ，而不是words 。
```
 return ' '.join(stemmer.stem(w) for w in meaningful_words]) 
```

使用NLTK阻止單詞（python）

問題描述

1 個解決方案

解決方案1
3 已采納 2017-08-14 08:46:23

使用NLTK阻止單詞（python）

問題描述

1 個解決方案

解決方案1 3 已采納 2017-08-14 08:46:23

解決方案1
3 已采納 2017-08-14 08:46:23