简体   繁体   中英

Why the stopwords won't be filtered in my program

I used principally the stopwords list of NLTK like what the codes show

from nltk.corpus import stopwords`
stopword_nltk=stopwords.words('french')
motoutil=['après', 'avant', 'avex', 'chez', '\ba\b', 'et', 'concernant', 'contre', 'dans', 'depuis', 'derrière', 'dès', 'devant', 'durant', 'en', 'entre', 'envers', 'hormis', 'hors', 'jusque', 'malgré', 'moyennant', 'nonobstant', 'outre', 'par', 'parmi pendant', 'pour', 'près', 'sans', 'sauf', 'selon', 'sous', 'suivant', 'sur', 'touchant', 'vers', 'via', 'tout','tous', 'toute', 'toutes', 'jusqu']
stopwords_list=stopword_nltk+motoutil

It's not because I added another list to stopword_nltk that the program doesn't feed my need. Even though I remove motoutil, it doesn't work, either.

And this is the part where I plan to delete the stopwords:

for line in f_in.readlines():
    new_line=re.sub('\W',' ', line.lower())
    list_word=new_line.split()
    for element in list_word:
        if element in stopwords_list:
            cleaned_line=re.sub(element, ' ', new_line)
            f_out_trameur.write(cleaned_line)
            f_out_cleaned.write(cleaned_line)

It has two problems:

Firstly, the stopwords listed won't be all removed, such as 'et'.

Secondly, I want to also delete the word 'de' and 'ce' but not the two portions in the middle of a word. for example, in the extract "madame monsieur le président de l'assemblée nationale", the de preceding the word président should be cleaned, but not the "de" in the word président, with my actual script, président will be "prési nt"

Do I see it right that you are creating and writing the cleaned line within the inner loop that iterates over the tokens in a line produced by new_line.split() ? And if there was nothing found to clean it is not written at all?

This would result in multiple versions of lines with stopwords (one version with each of the stopwords removed), while lines which do no contain stopwords are just skipped.

What I would suggest is since you already have the tokens (you used split() ) you just use these to write the new line instead of replacing the tokens in the new line.

This also allows you to convert the list of stopwords to a set and making the check if element in stopwords_list much faster, since this is usually a large list and can get slow for large amounts of words. This is almost always a good way to speed things up when using NLTKs stopwords.

I would also recommend to use a list comprehension to avoid too many nested loops and conditions and make it more readable, but this is just a personal preference.

from nltk.corpus import stopwords
stopword_nltk=stopwords.words('french')
motoutil=['après', 'avant', 'avex', 'chez', '\ba\b', 'et', 'concernant', 'contre', 'dans', 'depuis', 'derrière', 'dès', 'devant', 'durant', 'en', 'entre', 'envers', 'hormis', 'hors', 'jusque', 'malgré', 'moyennant', 'nonobstant', 'outre', 'par', 'parmi pendant', 'pour', 'près', 'sans', 'sauf', 'selon', 'sous', 'suivant', 'sur', 'touchant', 'vers', 'via', 'tout','tous', 'toute', 'toutes', 'jusqu']
stopwords_set=set(stopword_nltk+motoutil)

for line in f_in.readlines():
    new_line = re.sub('\W',' ', line.lower())
    list_word = [word for word in new_line.split() if word not in stopwords_set]
    cleaned_line = ' '.join(list_word)
    f_out_trameur.write(cleaned_line)
    f_out_cleaned.write(cleaned_line)

Note that write() does not add a newline character \n , so you might have to add this ( f_out_trameur.write(cleaned_line+'\n') and f_out_cleaned.write(cleaned_line+'\n') ) depending on how you want your output file to look.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM