簡體   English   中英

將文本文件中的唯一單詞添加到 Python 中的列表中

[英]Add unique words from a text file to a list in python

假設我有以下文本文件:

 But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief

我想將此文件中的所有唯一單詞添加到列表中

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + words
    lst.sort()
print lst

但程序的optupt如下:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 
'window', 'with', 'yonder']

“and”和其他一些詞在列表中多次出現。 我應該更改循環的哪個部分,以便我沒有任何重復的單詞? 謝謝!

以下是您的代碼存在的問題,修正后的版本如下:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list()                   # lst = [] is more Pythonic
for line in fname:
    line = line.rstrip()       # not required, `split()` will do this anyway
    words = line.split(' ')    # don't specify a delimiter, `line.split()` will split on all white space
    for word in words:
        if word in lst: continue
        lst = lst + words      # this is the reason that you end up with duplicates... words is the list of all words for this line!
    lst.sort()                 # don't sort in the for loop, just once afterwards.
print lst

所以它幾乎可以工作,但是,您應該只將當前word附加到列表中,而不是從split()行中獲得的所有words 如果您只是更改了該行:

lst = lst + words

lst.append(word)

它會起作用。

這是一個更正的版本:

with open("romeo.txt") as infile:
    lst = []
    for line in infile:
        words = line.split()
        for word in words:
            if word not in lst:
                lst.append(word)    # append only this word to the list, not all words on this line
    lst.sort()
    print(lst)

正如其他人所建議的那樣, set是處理此問題的好方法。 這很簡單:

with open('romeo.txt') as infile:
    print(sorted(set(infile.read().split())))

使用sorted()您不需要保留對列表的引用。 如果您確實想在其他地方使用排序列表,請執行以下操作:

with open('romeo.txt') as infile:
    unique_words = sorted(set(infile.read().split()))
    print(unique_words)

對於大文件,將整個文件讀入內存可能不可行。 您可以使用生成器有效地讀取文件,而不會弄亂主代碼。 該生成器將一次讀取一行文件,一次生成一個單詞。 它不會一次性讀取整個文件,除非文件包含一長行(您的示例數據顯然沒有):

def get_words(f):
    for line in f:
        for word in line.split():
            yield word

with open('romeo.txt') as infile:
    unique_words = sorted(set(get_words(infile)))

這在 python 中使用集合要容易得多:

with open("romeo.txt") as f:
     unique_words = set(f.read().split())

如果您想要一個列表,請在之后轉換它:

 unique_words = list(unique_words) 

將它們按字母順序排列可能會很好:

unique_words.sort()

有幾種方法可以實現您想要的。
1) 使用列表:

fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
    if word not in lst:
        lst.append(word)
lst.sort()
print lst

2)使用集合:

fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst

Set 只是忽略重復項,因此不需要檢查

如果您想獲得一組獨特的詞,最好使用set ,而不是list ,因為in lst可能效率很低。

為了計算單詞,您最好使用Counter對象

我會這樣做:

with open('romeo.txt') as fname:
    text = fname.read()
    lst = list(set(text.split()))
    print lst


>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']

使用word代替words (也簡化了循環)

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word not in lst:
            lst.append(word)
    lst.sort()
print lst

或交替使用[word]+運算符

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + [word]
    lst.sort()
print lst
import string
with open("romeo.txt") as file:
    lst = []
    uniquewords = open('romeo_unique.txt', 'w') # opens the file
    for line in file:
        words = line.split()
        for word in words: # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).lower()
            if word not in lst:
                lst.append(word)    # append only this unique word to the list
                uniquewords.write(str(word) + '\n') # write the unique word to the file

你需要改變

lst = lst + words to lst.append(word)

如果您想要唯一的單詞,您需要在列表中添加word而不是words (即所有單詞)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM