[英]Stemming words with NLTK (python)
I am new to Python text processing, I am trying to stem word in text document, has around 5000 rows. 我是Python文本处理的新手,我试图阻止文本文档中的单词,大约有5000行。
I have written below script 我写了下面的脚本
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
def Description_to_words(raw_Description ):
# 1. Remove HTML
Description_text = BeautifulSoup(raw_Description).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 5. stem words
words = ([stemmer.stem(w) for w in words])
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
clean_Description = Description_to_words(train["Description"][15])
But when I test results words were not stemmed , can anyone help me to know what is issue , I am doing something wrong in "Description_to_words" function 但是当我测试结果词没有被阻止时,谁能帮助我知道问题所在,那么我在“ Description_to_words”函数中做错了什么
And, when I execute stem command separately like below it works. 而且,当我像下面分别执行干命令时,它可以工作。
from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>>
>>> for w in words:
... print(stemmer.stem(w))
...
mobil
app
-
unabl
to
add
read
Here's each step of your function, fixed. 这是固定功能的每个步骤。
Remove HTML. 删除HTML。
Description_text = BeautifulSoup(raw_Description).get_text()
Remove non-letters, but don't remove whitespaces just yet. 删除非字母,但暂时不要删除空格。 You can also simplify your regex a bit.
您还可以稍微简化一下正则表达式。
letters_only = re.sub("[^\\w\\s]", " ", Description_text)
Convert to lower case, split into individual words: I recommend using word_tokenize
again, here. 转换为小写,分割成单个单词:我建议在这里再次使用
word_tokenize
。
from nltk.tokenize import word_tokenize words = word_tokenize(letters_only.lower())
Remove stop words. 删除停用词。
stops = set(stopwords.words("english")) meaningful_words = [w for w in words if not w in stops]
Stem words. 词干。 Here is another issue.
这是另一个问题。 Stem
meaningful_words
, not words
. 阻止
meaningful_words
words
,而不是words
。
return ' '.join(stemmer.stem(w) for w in meaningful_words])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.