[英]Add unique words from a text file to a list in python
suppose I have the following text file:假设我有以下文本文件:
But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief
I want to add all unique words in this file to a list我想将此文件中的所有唯一单词添加到列表中
fname = open("romeo.txt")
lst = list()
for line in fname:
line = line.rstrip()
words = line.split(' ')
for word in words:
if word in lst: continue
lst = lst + words
lst.sort()
print lst
but the optupt of the program is as follows:但程序的optupt如下:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and',
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief',
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick',
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what',
'window', 'with', 'yonder']
'and' and a few other words appear multiple times in the list. “and”和其他一些词在列表中多次出现。 Which part of the loop should I change, so that I don't have any duplicate words?
我应该更改循环的哪个部分,以便我没有任何重复的单词? Thanks!
谢谢!
Here are the problems with your code and a corrected version follows:以下是您的代码存在的问题,修正后的版本如下:
fname = open("romeo.txt") # better to open files in a `with` statement
lst = list() # lst = [] is more Pythonic
for line in fname:
line = line.rstrip() # not required, `split()` will do this anyway
words = line.split(' ') # don't specify a delimiter, `line.split()` will split on all white space
for word in words:
if word in lst: continue
lst = lst + words # this is the reason that you end up with duplicates... words is the list of all words for this line!
lst.sort() # don't sort in the for loop, just once afterwards.
print lst
So it almost works, however, you should be appending only the current word
to the list, not all of the words
that you got from the line with split()
.所以它几乎可以工作,但是,您应该只将当前
word
附加到列表中,而不是从split()
行中获得的所有words
。 If you simply changed the line:如果您只是更改了该行:
lst = lst + words
to到
lst.append(word)
it will work.它会起作用。
Here is a corrected version:这是一个更正的版本:
with open("romeo.txt") as infile:
lst = []
for line in infile:
words = line.split()
for word in words:
if word not in lst:
lst.append(word) # append only this word to the list, not all words on this line
lst.sort()
print(lst)
As others have suggested, a set
is a good way to handle this.正如其他人所建议的那样,
set
是处理此问题的好方法。 This is about as simple as it gets:这很简单:
with open('romeo.txt') as infile:
print(sorted(set(infile.read().split())))
Using sorted()
you don't need to keep a reference to the list.使用
sorted()
您不需要保留对列表的引用。 If you do want to use the sorted list elsewhere, just do this:如果您确实想在其他地方使用排序列表,请执行以下操作:
with open('romeo.txt') as infile:
unique_words = sorted(set(infile.read().split()))
print(unique_words)
Reading the entire file into memory may not be viable for large files.对于大文件,将整个文件读入内存可能不可行。 You can use a generator to efficiently read the file without cluttering up the main code.
您可以使用生成器有效地读取文件,而不会弄乱主代码。 This generator will read the file one line at a time and it will yield one word at a time.
该生成器将一次读取一行文件,一次生成一个单词。 It will not read the entire file in one go, unless the file consists of one long line (which your sample data clearly doesn't):
它不会一次性读取整个文件,除非文件包含一长行(您的示例数据显然没有):
def get_words(f):
for line in f:
for word in line.split():
yield word
with open('romeo.txt') as infile:
unique_words = sorted(set(get_words(infile)))
This is much easier in python using sets:这在 python 中使用集合要容易得多:
with open("romeo.txt") as f:
unique_words = set(f.read().split())
If you want to have a list, convert it afterwards:如果您想要一个列表,请在之后转换它:
unique_words = list(unique_words)
Might be nice to have them in alphabeitcal order:将它们按字母顺序排列可能会很好:
unique_words.sort()
There are a few ways to achieve what you want.有几种方法可以实现您想要的。
1) Using lists: 1) 使用列表:
fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
if word not in lst:
lst.append(word)
lst.sort()
print lst
2) Using sets: 2)使用集合:
fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst
Set simply ignores the duplicates, so the check is unneccesary Set 只是忽略重复项,因此不需要检查
If you want to get a set of unique words, you better use set
, not list
, since in lst
may be highly inefficient.如果您想获得一组独特的词,最好使用
set
,而不是list
,因为in lst
可能效率很低。
For counting words you better use Counter
object .为了计算单词,您最好使用
Counter
对象。
I would do:我会这样做:
with open('romeo.txt') as fname:
text = fname.read()
lst = list(set(text.split()))
print lst
>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']
Use word
instead of words
(also simplified the loop)使用
word
代替words
(也简化了循环)
fname = open("romeo.txt")
lst = list()
for line in fname:
line = line.rstrip()
words = line.split(' ')
for word in words:
if word not in lst:
lst.append(word)
lst.sort()
print lst
or alternately use [word]
with +
operator或交替使用
[word]
和+
运算符
fname = open("romeo.txt")
lst = list()
for line in fname:
line = line.rstrip()
words = line.split(' ')
for word in words:
if word in lst: continue
lst = lst + [word]
lst.sort()
print lst
import string
with open("romeo.txt") as file:
lst = []
uniquewords = open('romeo_unique.txt', 'w') # opens the file
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).lower()
if word not in lst:
lst.append(word) # append only this unique word to the list
uniquewords.write(str(word) + '\n') # write the unique word to the file
You need to change你需要改变
lst = lst + words
to lst.append(word)
lst = lst + words
to lst.append(word)
If you want unique words you need to add word
and not words
( which is all words in line) to the list.如果您想要唯一的单词,您需要在列表中添加
word
而不是words
(即所有单词)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.