简体   繁体   中英

Problem with Python/NLTK Stop Words and File Write

I am trying to write to file a list of stop words from NLTK.

So, I wrote this script:

import nltk
from nltk.corpus import stopwords
from string import punctuation

file_name = 'OUTPUT.CSV'
file = open(file_name, 'w+')  
_stopwords = set(stopwords.words('english')+list(punctuation)) 
i = 0
file.write(f'\n\nSTOP WORDS:+++\n\n')
for w in _stopwords:
    i=i+1
    out1 = f'{i:3}. {w}\n'
    out2 = f'{w}\n'
    out3 = f'{i:3}. {w}'
    file.write(out2)
    print(out3)

file.close()

The original program used file.write(w) , but since I encountered problems, I started trying things.

So, I tried using file.write(out1) . That works, but the order of the stop words appear to be random.

What's interesting is that if I use file.write(out2) , I only write a random number of stop words that appear to show up in random order, always short of 211. I experience the same problem both in Visual Studio 2017 and Jupyter Notebook.

For example, the last run wrote 175 words ending with:

its
wouldn
shan 

Using file.write(out1) I get all 211 words and the column ends like this:

209. more
210. have
211. ,

Has anyone run into a similar problem. Any idea of what may be going on?

I'm new to Python/NLTK so I decided to ask.

The reason you are getting a random order of stop words is due to use of set .

_stopwords = set(stopwords.words('english')+list(punctuation)) 

A set is an unordered collection with no duplicate elements. Read more here .

Unlike arrays, where the elements are stored as ordered list, the order of elements in a set is undefined (moreover, the set elements are usually not stored in order of appearance in the set; this allows checking if an element belongs to a set faster than just going through all the elements of the set).

You can use this simple example to check this:

test = set('abcd')
for i in test: 
    print(i) 

It outputs different order (eg I tried on two different systems, this is what I got): On Ist system

a
d
b
c

and, on the second system

d
c
a
b

There are other alternatives for ordered sets. Check here .


Besides, I've checked that all three out1 , out2 , and out3 gives 211 stop words.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM