简体   繁体   中英

removing duplicates from a list of strings

I am trying to read a file, make a list of words and then make a new list of words removing the duplicates. I am not able to append the words to the new list. it says none type object has no attribute'append'

Here is the bit of code:

fh = open("gdgf.txt")
lst = list()

file = fh.read()
for line in fh:
    line = line.rstrip()

file = file.split()
for word in file:
    if word  in lst: 
        continue
    lst = lst.append(word)

print lst

python append will return None .So set will help here to remove duplicates.

In [102]: mylist = ["aa","bb","cc","aa"]

In [103]: list(set(mylist))
Out[103]: ['aa', 'cc', 'bb']

Hope this helps

In your case

file = fh.read()

After this fh will be an empty generator.So you cannot use it since it is already used.You have to do operations with variable file

append appends an item in-place which means it does not return any value. You should get rid of lst= when appending word :

if word in lst:
    continue
lst.append(word)

You are replacing your list with the return value of the append function, which is not a list. Simply do this instead:

lst.append(word)

list.append() is inplace append, it returns None (as it does not return anything). so you do not need to set the return value of list.append() back to the list. Just change the line - lst=lst.append(word) to -

lst.append(word)

Another issue, you are first calling .read() on the file and then iterating over its lines, you do not need to do that. Just remove the iteration part.


Also, an easy way to remove duplicates, if you are not interested in the order of the elements is to use set.

Example -

>>> lst = [1,2,3,4,1,1,2,3]
>>> set(lst)
{1, 2, 3, 4}

So, in your case you can initialize lst as - lst=set() . And then use lst.add() element, you would not even need to do a check whether the element already exists or not. At the end, if you really want the result as a list, do - list(lst) , to convert it to list. (Though when doing this, you want to consider renaming the variable to something better that makes it easy to understand that its a set not a list )

append modifies the list it was called on, and returns None . Ie, you should replace the line:

lst=lst.append(word)

with simply

lst.append(word)
fh=open("gdgf.txt")

file=fh.read()
for line in fh:
    line=line.rstrip()
lst = []
file=file.split()
for word in file:
    lst.append(word)
print (set(lst))

append() does not return anything, so don't assign it. lst.append() is enough.

Modified Code:

fh = open("gdgf.txt")
lst = []

file=fh.read()
for line in fh:
     line = line.rstrip()

file=file.split()

for word in file:
     if word  in lst: 
         continue
     lst.append(word)

print lst

I suggest you use set() , because it is used for unordered collections of unique elements.

fh = open("gdgf.txt")
lst = []

file = fh.read()
for line in fh:
     line = line.rstrip()

file = file.split()

lst = list( set(lst) )

print lst

You can simplify your code by reading and adding the words directly to a set. Sets do not allow duplicates, so you'll be left with just the unique words:

words = set()

with open('gdgf.txt') as f:
   for line in f:
      for word in line.strip():
          words.add(word.strip())

print(words)

The problem with the logic above, is that words that end in punctuation will be counted as separate words:

>>> s = "Hello? Hello should only be twice in the list"
>>> set(s.split())
set(['be', 'twice', 'list', 'should', 'Hello?', 'only', 'in', 'the', 'Hello'])

You can see you have Hello? and Hello .

You can enhance the code above by using a regular expression to extract words, which will take care of the punctuation:

>>> set(re.findall(r"(\w[\w']*\w|\w)", s))
set(['be', 'list', 'should', 'twice', 'only', 'in', 'the', 'Hello'])

Now your code is:

import re

with open('gdgf.txt') as f:
   words = set(re.findall(r"(\w[\w']*\w|\w)", f.read(), re.M))

print(words)

Even with the above, you'll have duplicates as Word and word will be counted twice. You can enhance it further if you want to store a single version of each word.

I think the solution to this problem can be more succinct:

 import string with open("gdgf.txt") as fh: word_set = set() for line in fh: line = line.split() for word in line: # For each character in string.punctuation, iterate and remove # from the word by replacing with '', an empty string for char in string.punctuation: word = word.replace(char, '') # Add the word to the set word_set.add(word) word_list = list(word_set) # Sort the set to be fastidious. word_list.sort() print(word_list) 

One thing about counting words by "split" is that you are splitting on whitespace, so this will make "words" out of things like "Hello!" and "Really?" The words will include punctuation, which may probably not be what you want.

Your variable names could be a bit more descriptive, and your indentation seems a bit off, but I think it may be the matter of cutting/pasting into the posting. I have tried to name the variables I used based on whatever the logical structure it is I am interacting with (file, line, word, char, and so on).

To see the contents of 'string.punctuation' you can launch iPython, import string, then simply enter string.punctuation to see what is the what.

It is also unclear if you need to have a list, or if you just need a data structure that contains a unique list of words. A set or a list that has been properly created to avoid duplicates should do the trick. Following on with the question, I used a set to uniquely store elements, then converted that set to a list trivially, and later sorted this alphabetically.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM