繁体   English   中英

将文本文件中的唯一单词添加到 Python 中的列表中

[英]Add unique words from a text file to a list in python

假设我有以下文本文件:

 But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief

我想将此文件中的所有唯一单词添加到列表中

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + words
    lst.sort()
print lst

但程序的optupt如下:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 
'window', 'with', 'yonder']

“and”和其他一些词在列表中多次出现。 我应该更改循环的哪个部分,以便我没有任何重复的单词? 谢谢!

以下是您的代码存在的问题,修正后的版本如下:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list()                   # lst = [] is more Pythonic
for line in fname:
    line = line.rstrip()       # not required, `split()` will do this anyway
    words = line.split(' ')    # don't specify a delimiter, `line.split()` will split on all white space
    for word in words:
        if word in lst: continue
        lst = lst + words      # this is the reason that you end up with duplicates... words is the list of all words for this line!
    lst.sort()                 # don't sort in the for loop, just once afterwards.
print lst

所以它几乎可以工作,但是,您应该只将当前word附加到列表中,而不是从split()行中获得的所有words 如果您只是更改了该行:

lst = lst + words

lst.append(word)

它会起作用。

这是一个更正的版本:

with open("romeo.txt") as infile:
    lst = []
    for line in infile:
        words = line.split()
        for word in words:
            if word not in lst:
                lst.append(word)    # append only this word to the list, not all words on this line
    lst.sort()
    print(lst)

正如其他人所建议的那样, set是处理此问题的好方法。 这很简单:

with open('romeo.txt') as infile:
    print(sorted(set(infile.read().split())))

使用sorted()您不需要保留对列表的引用。 如果您确实想在其他地方使用排序列表,请执行以下操作:

with open('romeo.txt') as infile:
    unique_words = sorted(set(infile.read().split()))
    print(unique_words)

对于大文件,将整个文件读入内存可能不可行。 您可以使用生成器有效地读取文件,而不会弄乱主代码。 该生成器将一次读取一行文件,一次生成一个单词。 它不会一次性读取整个文件,除非文件包含一长行(您的示例数据显然没有):

def get_words(f):
    for line in f:
        for word in line.split():
            yield word

with open('romeo.txt') as infile:
    unique_words = sorted(set(get_words(infile)))

这在 python 中使用集合要容易得多:

with open("romeo.txt") as f:
     unique_words = set(f.read().split())

如果您想要一个列表,请在之后转换它:

 unique_words = list(unique_words) 

将它们按字母顺序排列可能会很好:

unique_words.sort()

有几种方法可以实现您想要的。
1) 使用列表:

fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
    if word not in lst:
        lst.append(word)
lst.sort()
print lst

2)使用集合:

fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst

Set 只是忽略重复项,因此不需要检查

如果您想获得一组独特的词,最好使用set ,而不是list ,因为in lst可能效率很低。

为了计算单词,您最好使用Counter对象

我会这样做:

with open('romeo.txt') as fname:
    text = fname.read()
    lst = list(set(text.split()))
    print lst


>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']

使用word代替words (也简化了循环)

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word not in lst:
            lst.append(word)
    lst.sort()
print lst

或交替使用[word]+运算符

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + [word]
    lst.sort()
print lst
import string
with open("romeo.txt") as file:
    lst = []
    uniquewords = open('romeo_unique.txt', 'w') # opens the file
    for line in file:
        words = line.split()
        for word in words: # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).lower()
            if word not in lst:
                lst.append(word)    # append only this unique word to the list
                uniquewords.write(str(word) + '\n') # write the unique word to the file

你需要改变

lst = lst + words to lst.append(word)

如果您想要唯一的单词,您需要在列表中添加word而不是words (即所有单词)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM