简体   繁体   English

将文本文件中的唯一单词添加到 Python 中的列表中

[英]Add unique words from a text file to a list in python

suppose I have the following text file:假设我有以下文本文件:

 But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief

I want to add all unique words in this file to a list我想将此文件中的所有唯一单词添加到列表中

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + words
    lst.sort()
print lst

but the optupt of the program is as follows:但程序的optupt如下:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 
'window', 'with', 'yonder']

'and' and a few other words appear multiple times in the list. “and”和其他一些词在列表中多次出现。 Which part of the loop should I change, so that I don't have any duplicate words?我应该更改循环的哪个部分,以便我没有任何重复的单词? Thanks!谢谢!

Here are the problems with your code and a corrected version follows:以下是您的代码存在的问题,修正后的版本如下:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list()                   # lst = [] is more Pythonic
for line in fname:
    line = line.rstrip()       # not required, `split()` will do this anyway
    words = line.split(' ')    # don't specify a delimiter, `line.split()` will split on all white space
    for word in words:
        if word in lst: continue
        lst = lst + words      # this is the reason that you end up with duplicates... words is the list of all words for this line!
    lst.sort()                 # don't sort in the for loop, just once afterwards.
print lst

So it almost works, however, you should be appending only the current word to the list, not all of the words that you got from the line with split() .所以它几乎可以工作,但是,您应该只将当前word附加到列表中,而不是从split()行中获得的所有words If you simply changed the line:如果您只是更改了该行:

lst = lst + words

to

lst.append(word)

it will work.它会起作用。

Here is a corrected version:这是一个更正的版本:

with open("romeo.txt") as infile:
    lst = []
    for line in infile:
        words = line.split()
        for word in words:
            if word not in lst:
                lst.append(word)    # append only this word to the list, not all words on this line
    lst.sort()
    print(lst)

As others have suggested, a set is a good way to handle this.正如其他人所建议的那样, set是处理此问题的好方法。 This is about as simple as it gets:这很简单:

with open('romeo.txt') as infile:
    print(sorted(set(infile.read().split())))

Using sorted() you don't need to keep a reference to the list.使用sorted()您不需要保留对列表的引用。 If you do want to use the sorted list elsewhere, just do this:如果您确实想在其他地方使用排序列表,请执行以下操作:

with open('romeo.txt') as infile:
    unique_words = sorted(set(infile.read().split()))
    print(unique_words)

Reading the entire file into memory may not be viable for large files.对于大文件,将整个文件读入内存可能不可行。 You can use a generator to efficiently read the file without cluttering up the main code.您可以使用生成器有效地读取文件,而不会弄乱主代码。 This generator will read the file one line at a time and it will yield one word at a time.该生成器将一次读取一行文件,一次生成一个单词。 It will not read the entire file in one go, unless the file consists of one long line (which your sample data clearly doesn't):它不会一次性读取整个文件,除非文件包含一长行(您的示例数据显然没有):

def get_words(f):
    for line in f:
        for word in line.split():
            yield word

with open('romeo.txt') as infile:
    unique_words = sorted(set(get_words(infile)))

This is much easier in python using sets:这在 python 中使用集合要容易得多:

with open("romeo.txt") as f:
     unique_words = set(f.read().split())

If you want to have a list, convert it afterwards:如果您想要一个列表,请在之后转换它:

 unique_words = list(unique_words) 

Might be nice to have them in alphabeitcal order:将它们按字母顺序排列可能会很好:

unique_words.sort()

There are a few ways to achieve what you want.有几种方法可以实现您想要的。
1) Using lists: 1) 使用列表:

fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
    if word not in lst:
        lst.append(word)
lst.sort()
print lst

2) Using sets: 2)使用集合:

fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst

Set simply ignores the duplicates, so the check is unneccesary Set 只是忽略重复项,因此不需要检查

If you want to get a set of unique words, you better use set , not list , since in lst may be highly inefficient.如果您想获得一组独特的词,最好使用set ,而不是list ,因为in lst可能效率很低。

For counting words you better use Counter object .为了计算单词,您最好使用Counter对象

I would do:我会这样做:

with open('romeo.txt') as fname:
    text = fname.read()
    lst = list(set(text.split()))
    print lst


>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']

Use word instead of words (also simplified the loop)使用word代替words (也简化了循环)

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word not in lst:
            lst.append(word)
    lst.sort()
print lst

or alternately use [word] with + operator或交替使用[word]+运算符

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + [word]
    lst.sort()
print lst
import string
with open("romeo.txt") as file:
    lst = []
    uniquewords = open('romeo_unique.txt', 'w') # opens the file
    for line in file:
        words = line.split()
        for word in words: # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).lower()
            if word not in lst:
                lst.append(word)    # append only this unique word to the list
                uniquewords.write(str(word) + '\n') # write the unique word to the file

You need to change你需要改变

lst = lst + words to lst.append(word) lst = lst + words to lst.append(word)

If you want unique words you need to add word and not words ( which is all words in line) to the list.如果您想要唯一的单词,您需要在列表中添加word而不是words (即所有单词)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM