简体   繁体   中英

Need to remove duplicates from a list. Set() function isn't working. Neither is a for loop method

I am scraping data from an excel sheet using xlrd. The data I want is in two columns (columns with "IDs" and "locations"). Each column contains thousands of entries, most of which are exact duplicates. I am simply trying to create 2 lists that contain all the unique entries from both excel columns. This is most of my code, and showing an example of what it returns when I print one of the lists:

rawIDs = data.col_slice(colx=0,
                 start_rowx=0,
                 end_rowx=None) #getting all of column 1 in a list
IDs = []

for ID in rawIDs:
    if ID not in IDs:
        IDs.append(ID) #trying to create new list without duplicates, but it fails

rawlocations = data.col_slice(colx=1,
                     start_rowx=0,
                     end_rowx=None) #getting all of column 2 in a list

locations = []

for location in rawlocations:
    if location not in locations:
        locations.append(location) #same as before, also fails

print set(IDs) #even set() doesn't remove duplicates, it just prints "rawIDs"

No matter what I seem to do, it always prints the original list, with all the duplicates remaining.

Goes without saying but, I have already looked at a lot of other similar stackoverflow posts and their solutions don't work for me.

edit: I was wrong about a particular. I realized that printing

print set(IDs)

actually returns

"set([item, item, item...])" as the output. So it basically puts "set()" around the "rawIDs" output. This doesn't make sense to me either though...

Also here is an example screenshot:

这是一个示例屏幕截图

THE SOLUTION:

It seems that metadata (like maybe the coordinate position in the table) was being stored so each item in the lists was actually distinct due to this metadata even though the text might be the same.

Modifying the for loops so they add the strings of the items, rather than the items themselves, solved my problem and yielded new lists with no duplicates.

rawIDs = data.col_slice(colx=0,
                     start_rowx=5000,
                     end_rowx=5050)

IDs = []

for ID in rawIDs:
    if str(ID) not in IDs:
        IDs.append(str(ID))

rawlocations = data.col_slice(colx=1,
                     start_rowx=0,
                     end_rowx=None)

locations = []

for location in rawlocations:
    if str(location) not in locations:
        locations.append(str(location))

print IDs #it prints a list with no duplicates!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM