extract unique elements from a 2D python list and put them into a new 2D list

Question

Right now I have a 2D list with three columns and numerous rows, each column contains a unique type of stuff. The first column is UserID, the second column is timestamp, the third column is URL. The list looks like this:

[[304070, 2015:01:01, 'http:something1'],
[304070, 2015:01:02, 'http:something2'],
[304070, 2015:01:03, 'http:something2'],
[304070, 2015:01:03, 'http:something2'],
[304071, 2015:01:04, 'http:something2'],
[304071, 2015:01:05, 'http:something3'],
[304071, 2015:01:06, 'http:something3']]

As you can see, there are some duplicate URLs, regardless of userID and timestamp.

I need to extract those rows which contain unique URLs and put them into a new 2D list.

For example, the second row, third row, forth row and fifth row all have the same URL regardless of userID and timestamp. I only need the second row (first one appears) and put it into my new 2D list. That being said, the first row has a unique URL and I will also put it into my new list. The last two rows ( sixth and seventh ) have the same URL, and I only need the sixth row.

Therefore, my new list should look like this:

[304070, 2015:01:01, 'http:something1'],
[304070, 2015:01:02, 'http:something2'],
[304071, 2015:01:05, 'http:something3']]

I thought about using something like this:

for i in range(len(oldList):
    if oldList[i][2] not in newList:
        newList.append(oldList[i])

but obviously this one does not work, becuase oldList[i][2] is an element, not in newList is checking the entire 2D list, ie, checking every row. Codes like this will just create an exact copy of oldList .

OR, I could just eliminate those rows having duplicate URLs, because using a for loop plus append operator on a 2D list with one million rows really would take a while.

Answer 1

A good way of going about this would be to use a set . Go through your list of lists one at a time, adding the URL to the set if it's not already there, and adding the full list containing that URL to your new list. If a URL is already in the set, discard the current list and move to the next one.

old_list = [[304070, "2015:01:01", 'http:something1'],
            [304070, "2015:01:02", 'http:something2'],
            [304070, "2015:01:03", 'http:something2'],
            [304070, "2015:01:03", 'http:something2'],
            [304071, "2015:01:04", 'http:something2'],
            [304071, "2015:01:05", 'http:something3'],
            [304071, "2015:01:06", 'http:something3']]
new_list = []
url_set = set()

for item in old_list:
    if item[2] not in url_set:
        url_set.add(item[2])
        new_list.append(item)
    else:
        pass

>>> print(new_list)
[[304070, '2015:01:01', 'http:something1'], [304070, '2015:01:02', 'http:something2'], [304071, '2015:01:05', 'http:something3']]

Answer 2

>>> old_list = [[304070, "2015:01:01", 'http:something1'],
...            [304070, "2015:01:02", 'http:something2'],
...            [304070, "2015:01:03", 'http:something2'],
...            [304070, "2015:01:03", 'http:something2'],
...            [304071, "2015:01:04", 'http:something2'],
...            [304071, "2015:01:05", 'http:something3'],
...            [304071, "2015:01:06", 'http:something3']]
>>> temp_dict = {}
>>> for element in old_list:
...     if element[2] not in temp_dict:
...         temp_dict[element[2]] = [element[0], element[1], element[2]]
... 
>>> temp_dict.values()
[[304070, '2015:01:01', [304070, '2015:01:02', 'http:something2'], 'http:something1'], [304071, '2015:01:05', 'http:something3']]

Note : I am assuming that the order of different URLs in the list doesn't matter. In case it does matter, use OrderedDict instead of default dict .

Answer 3

You need to create a function which searches the list for item with the url.

def hasUrl(list, url):
    for item in list:
        if item[1] == url:
            return True
    return False

Then your new list creation algorithm should look like this.

for i in range(len(oldList)):
    if not hasUrl(newList, oldList[i][2]): # check if url is in list
        newList.append(oldList[i])

Also, there is no need to create a range. Python for loop iterates by values, so you can write just

for item in oldList:
    if not hasUrl(newList, item[2]): # check if url is not in list
        newList.append(item)

Answer 4

my_list = [[304070, '2015:01:01', 'http:something1'],
           [304070, '2015:01:02', 'http:something2'],
           [304070, '2015:01:03', 'http:something2'],
           [304070, '2015:01:03', 'http:something2'],
           [304071, '2015:01:04', 'http:something2'],
           [304071, '2015:01:05', 'http:something3'],
           [304071, '2015:01:06', 'http:something3']]

Pull out all of the urls from the original list. Create a set from this list to generate unique values for the urls. Use a list comprehension to iterate through this set and use index on the url list generated ( urls ) to locate the first occurrence of that url.

Lastly, use another list comprehension together with enumerate to select rows that have matching index values.

urls = [row[2] for row in my_list]
urls_unique = set(urls)
idx = [urls.index(url) for url in urls_unique]
my_shorter_list = [row for n, row in enumerate(my_list) if n in idx]

>>> my_shorter_list
[[304070, '2015:01:01', 'http:something1'],
 [304070, '2015:01:02', 'http:something2'],
 [304071, '2015:01:05', 'http:something3']]

extract unique elements from a 2D python list and put them into a new 2D list

Question

4 answers

solution1
1 ACCPTED 2016-03-01 02:39:49

solution2
1 2016-03-01 02:55:05

solution3
0 2016-03-01 02:39:14

solution4
0 2016-03-01 02:48:55

extract unique elements from a 2D python list and put them into a new 2D list

Question

4 answers

solution1 1 ACCPTED 2016-03-01 02:39:49

solution2 1 2016-03-01 02:55:05

solution3 0 2016-03-01 02:39:14

solution4 0 2016-03-01 02:48:55

solution1
1 ACCPTED 2016-03-01 02:39:49

solution2
1 2016-03-01 02:55:05

solution3
0 2016-03-01 02:39:14

solution4
0 2016-03-01 02:48:55