简体   繁体   中英

How to solve missing words in nltk.corpus.words.words()?

I have tried to remove non-English words from a text. Problem many other words are absent from the NLTK words corpus.

My code:

import pandas as pd
    
lst = ['I have equipped my house with a new [xxx] HP203X climatisation unit']
df = pd.DataFrame(lst, columns=['Sentences'])
    
import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())
    
df['Sentences'] = df['Sentences'].apply(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in (words)))
df

Input: I have equipped my house with a new [xxx] HP203X climatisation unit
Result: I have my house with a new unit

Should have been: I have equipped my house with a new climatisation unit

I can't figure out how to complete nltk.corpus.words.words() to avoid words like equipped , climatisation to be remouved from the sentences.

You can use

words.update(['climatisation', 'equipped'])

Here, words is a set, that is why .extend(word_list) did not work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM