I am trying to write to file a list of stop words from NLTK.
So, I wrote this script:
import nltk
from nltk.corpus import stopwords
from string import punctuation
file_name = 'OUTPUT.CSV'
file = open(file_name, 'w+')
_stopwords = set(stopwords.words('english')+list(punctuation))
i = 0
file.write(f'\n\nSTOP WORDS:+++\n\n')
for w in _stopwords:
i=i+1
out1 = f'{i:3}. {w}\n'
out2 = f'{w}\n'
out3 = f'{i:3}. {w}'
file.write(out2)
print(out3)
file.close()
The original program used file.write(w)
, but since I encountered problems, I started trying things.
So, I tried using file.write(out1)
. That works, but the order of the stop words appear to be random.
What's interesting is that if I use file.write(out2)
, I only write a random number of stop words that appear to show up in random order, always short of 211. I experience the same problem both in Visual Studio 2017 and Jupyter Notebook.
For example, the last run wrote 175 words ending with:
its
wouldn
shan
Using file.write(out1)
I get all 211 words and the column ends like this:
209. more
210. have
211. ,
Has anyone run into a similar problem. Any idea of what may be going on?
I'm new to Python/NLTK so I decided to ask.
The reason you are getting a random order of stop words is due to use of set
.
_stopwords = set(stopwords.words('english')+list(punctuation))
A set is an unordered collection with no duplicate elements. Read more here .
Unlike arrays, where the elements are stored as ordered list, the order of elements in a set is undefined (moreover, the set elements are usually not stored in order of appearance in the set; this allows checking if an element belongs to a set faster than just going through all the elements of the set).
You can use this simple example to check this:
test = set('abcd')
for i in test:
print(i)
It outputs different order (eg I tried on two different systems, this is what I got): On Ist system
a
d
b
c
and, on the second system
d
c
a
b
There are other alternatives for ordered sets. Check here .
Besides, I've checked that all three out1
, out2
, and out3
gives 211 stop words.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.