suppose I have the following text file:
But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief
I want to add all unique words in this file to a list
fname = open("romeo.txt")
lst = list()
for line in fname:
line = line.rstrip()
words = line.split(' ')
for word in words:
if word in lst: continue
lst = lst + words
lst.sort()
print lst
but the optupt of the program is as follows:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and',
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief',
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick',
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what',
'window', 'with', 'yonder']
'and' and a few other words appear multiple times in the list. Which part of the loop should I change, so that I don't have any duplicate words? Thanks!
Here are the problems with your code and a corrected version follows:
fname = open("romeo.txt") # better to open files in a `with` statement
lst = list() # lst = [] is more Pythonic
for line in fname:
line = line.rstrip() # not required, `split()` will do this anyway
words = line.split(' ') # don't specify a delimiter, `line.split()` will split on all white space
for word in words:
if word in lst: continue
lst = lst + words # this is the reason that you end up with duplicates... words is the list of all words for this line!
lst.sort() # don't sort in the for loop, just once afterwards.
print lst
So it almost works, however, you should be appending only the current word
to the list, not all of the words
that you got from the line with split()
. If you simply changed the line:
lst = lst + words
to
lst.append(word)
it will work.
Here is a corrected version:
with open("romeo.txt") as infile:
lst = []
for line in infile:
words = line.split()
for word in words:
if word not in lst:
lst.append(word) # append only this word to the list, not all words on this line
lst.sort()
print(lst)
As others have suggested, a set
is a good way to handle this. This is about as simple as it gets:
with open('romeo.txt') as infile:
print(sorted(set(infile.read().split())))
Using sorted()
you don't need to keep a reference to the list. If you do want to use the sorted list elsewhere, just do this:
with open('romeo.txt') as infile:
unique_words = sorted(set(infile.read().split()))
print(unique_words)
Reading the entire file into memory may not be viable for large files. You can use a generator to efficiently read the file without cluttering up the main code. This generator will read the file one line at a time and it will yield one word at a time. It will not read the entire file in one go, unless the file consists of one long line (which your sample data clearly doesn't):
def get_words(f):
for line in f:
for word in line.split():
yield word
with open('romeo.txt') as infile:
unique_words = sorted(set(get_words(infile)))
This is much easier in python using sets:
with open("romeo.txt") as f:
unique_words = set(f.read().split())
If you want to have a list, convert it afterwards:
unique_words = list(unique_words)
Might be nice to have them in alphabeitcal order:
unique_words.sort()
There are a few ways to achieve what you want.
1) Using lists:
fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
if word not in lst:
lst.append(word)
lst.sort()
print lst
2) Using sets:
fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst
Set simply ignores the duplicates, so the check is unneccesary
If you want to get a set of unique words, you better use set
, not list
, since in lst
may be highly inefficient.
For counting words you better use Counter
object .
I would do:
with open('romeo.txt') as fname:
text = fname.read()
lst = list(set(text.split()))
print lst
>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']
Use word
instead of words
(also simplified the loop)
fname = open("romeo.txt")
lst = list()
for line in fname:
line = line.rstrip()
words = line.split(' ')
for word in words:
if word not in lst:
lst.append(word)
lst.sort()
print lst
or alternately use [word]
with +
operator
fname = open("romeo.txt")
lst = list()
for line in fname:
line = line.rstrip()
words = line.split(' ')
for word in words:
if word in lst: continue
lst = lst + [word]
lst.sort()
print lst
import string
with open("romeo.txt") as file:
lst = []
uniquewords = open('romeo_unique.txt', 'w') # opens the file
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).lower()
if word not in lst:
lst.append(word) # append only this unique word to the list
uniquewords.write(str(word) + '\n') # write the unique word to the file
You need to change
lst = lst + words
to lst.append(word)
If you want unique words you need to add word
and not words
( which is all words in line) to the list.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.