简体   繁体   中英

Add unique words from a text file to a list in python

suppose I have the following text file:

 But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief

I want to add all unique words in this file to a list

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + words
    lst.sort()
print lst

but the optupt of the program is as follows:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 
'window', 'with', 'yonder']

'and' and a few other words appear multiple times in the list. Which part of the loop should I change, so that I don't have any duplicate words? Thanks!

Here are the problems with your code and a corrected version follows:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list()                   # lst = [] is more Pythonic
for line in fname:
    line = line.rstrip()       # not required, `split()` will do this anyway
    words = line.split(' ')    # don't specify a delimiter, `line.split()` will split on all white space
    for word in words:
        if word in lst: continue
        lst = lst + words      # this is the reason that you end up with duplicates... words is the list of all words for this line!
    lst.sort()                 # don't sort in the for loop, just once afterwards.
print lst

So it almost works, however, you should be appending only the current word to the list, not all of the words that you got from the line with split() . If you simply changed the line:

lst = lst + words

to

lst.append(word)

it will work.

Here is a corrected version:

with open("romeo.txt") as infile:
    lst = []
    for line in infile:
        words = line.split()
        for word in words:
            if word not in lst:
                lst.append(word)    # append only this word to the list, not all words on this line
    lst.sort()
    print(lst)

As others have suggested, a set is a good way to handle this. This is about as simple as it gets:

with open('romeo.txt') as infile:
    print(sorted(set(infile.read().split())))

Using sorted() you don't need to keep a reference to the list. If you do want to use the sorted list elsewhere, just do this:

with open('romeo.txt') as infile:
    unique_words = sorted(set(infile.read().split()))
    print(unique_words)

Reading the entire file into memory may not be viable for large files. You can use a generator to efficiently read the file without cluttering up the main code. This generator will read the file one line at a time and it will yield one word at a time. It will not read the entire file in one go, unless the file consists of one long line (which your sample data clearly doesn't):

def get_words(f):
    for line in f:
        for word in line.split():
            yield word

with open('romeo.txt') as infile:
    unique_words = sorted(set(get_words(infile)))

This is much easier in python using sets:

with open("romeo.txt") as f:
     unique_words = set(f.read().split())

If you want to have a list, convert it afterwards:

 unique_words = list(unique_words) 

Might be nice to have them in alphabeitcal order:

unique_words.sort()

There are a few ways to achieve what you want.
1) Using lists:

fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
    if word not in lst:
        lst.append(word)
lst.sort()
print lst

2) Using sets:

fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst

Set simply ignores the duplicates, so the check is unneccesary

If you want to get a set of unique words, you better use set , not list , since in lst may be highly inefficient.

For counting words you better use Counter object .

I would do:

with open('romeo.txt') as fname:
    text = fname.read()
    lst = list(set(text.split()))
    print lst


>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']

Use word instead of words (also simplified the loop)

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word not in lst:
            lst.append(word)
    lst.sort()
print lst

or alternately use [word] with + operator

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + [word]
    lst.sort()
print lst
import string
with open("romeo.txt") as file:
    lst = []
    uniquewords = open('romeo_unique.txt', 'w') # opens the file
    for line in file:
        words = line.split()
        for word in words: # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).lower()
            if word not in lst:
                lst.append(word)    # append only this unique word to the list
                uniquewords.write(str(word) + '\n') # write the unique word to the file

You need to change

lst = lst + words to lst.append(word)

If you want unique words you need to add word and not words ( which is all words in line) to the list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM