[英]extract unique elements from a 2D python list and put them into a new 2D list
Right now I have a 2D list with three columns and numerous rows, each column contains a unique type of stuff. 现在,我有一个包含三列和无数行的2D列表,每列包含一种唯一类型的东西。 The first column is UserID, the second column is timestamp, the third column is URL.
第一列是UserID,第二列是时间戳,第三列是URL。 The list looks like this:
该列表如下所示:
[[304070, 2015:01:01, 'http:something1'],
[304070, 2015:01:02, 'http:something2'],
[304070, 2015:01:03, 'http:something2'],
[304070, 2015:01:03, 'http:something2'],
[304071, 2015:01:04, 'http:something2'],
[304071, 2015:01:05, 'http:something3'],
[304071, 2015:01:06, 'http:something3']]
As you can see, there are some duplicate URLs, regardless of userID and timestamp. 如您所见,无论用户ID和时间戳如何,都有一些重复的URL。
I need to extract those rows which contain unique URLs and put them into a new 2D list. 我需要提取包含唯一URL的那些行,并将它们放入新的2D列表中。
For example, the second row, third row, forth row and fifth row all have the same URL regardless of userID and timestamp. 例如,第二行,第三行,第四行和第五行都具有相同的URL,而与userID和时间戳无关。 I only need the second row (first one appears) and put it into my new 2D list.
我只需要第二行(出现第一行)并将其放入新的2D列表中。 That being said, the first row has a unique URL and I will also put it into my new list.
话虽这么说,第一行有一个唯一的URL,我也将其放入新列表中。 The last two rows ( sixth and seventh ) have the same URL, and I only need the sixth row.
最后两行(第六和第七行)具有相同的URL,我只需要第六行。
Therefore, my new list should look like this: 因此,我的新列表应如下所示:
[304070, 2015:01:01, 'http:something1'],
[304070, 2015:01:02, 'http:something2'],
[304071, 2015:01:05, 'http:something3']]
I thought about using something like this: 我考虑过使用这样的东西:
for i in range(len(oldList):
if oldList[i][2] not in newList:
newList.append(oldList[i])
but obviously this one does not work, becuase oldList[i][2]
is an element, not in newList
is checking the entire 2D list, ie, checking every row. 但是显然这是行不通的,因为
oldList[i][2]
是一个元素, not in newList
中检查整个2D列表,即检查每一行。 Codes like this will just create an exact copy of oldList
. 这样的代码将只创建
oldList
的精确副本。
OR, I could just eliminate those rows having duplicate URLs, because using a for loop plus append operator on a 2D list with one million rows really would take a while. 或者,我可以消除那些URL重复的行,因为在具有一百万行的2D列表上使用for循环加append运算符确实需要一段时间。
A good way of going about this would be to use a set . 解决此问题的一个好方法是使用set 。 Go through your list of lists one at a time, adding the URL to the set if it's not already there, and adding the full list containing that URL to your new list.
一次遍历列表列表,将URL添加到集合中(如果尚未存在),然后将包含该URL的完整列表添加到新列表中。 If a URL is already in the set, discard the current list and move to the next one.
如果URL已经存在,则丢弃当前列表,然后移至下一个列表。
old_list = [[304070, "2015:01:01", 'http:something1'],
[304070, "2015:01:02", 'http:something2'],
[304070, "2015:01:03", 'http:something2'],
[304070, "2015:01:03", 'http:something2'],
[304071, "2015:01:04", 'http:something2'],
[304071, "2015:01:05", 'http:something3'],
[304071, "2015:01:06", 'http:something3']]
new_list = []
url_set = set()
for item in old_list:
if item[2] not in url_set:
url_set.add(item[2])
new_list.append(item)
else:
pass
>>> print(new_list)
[[304070, '2015:01:01', 'http:something1'], [304070, '2015:01:02', 'http:something2'], [304071, '2015:01:05', 'http:something3']]
>>> old_list = [[304070, "2015:01:01", 'http:something1'],
... [304070, "2015:01:02", 'http:something2'],
... [304070, "2015:01:03", 'http:something2'],
... [304070, "2015:01:03", 'http:something2'],
... [304071, "2015:01:04", 'http:something2'],
... [304071, "2015:01:05", 'http:something3'],
... [304071, "2015:01:06", 'http:something3']]
>>> temp_dict = {}
>>> for element in old_list:
... if element[2] not in temp_dict:
... temp_dict[element[2]] = [element[0], element[1], element[2]]
...
>>> temp_dict.values()
[[304070, '2015:01:01', [304070, '2015:01:02', 'http:something2'], 'http:something1'], [304071, '2015:01:05', 'http:something3']]
Note : I am assuming that the order of different URLs in the list doesn't matter. 注意 :我假设列表中不同URL的顺序无关紧要。 In case it does matter, use
OrderedDict
instead of default dict
. 如果确实如此,请使用
OrderedDict
而不是默认dict
。
You need to create a function which searches the list for item with the url. 您需要创建一个函数,该函数在列表中搜索带有url的项。
def hasUrl(list, url):
for item in list:
if item[1] == url:
return True
return False
Then your new list creation algorithm should look like this. 然后,新的列表创建算法应如下所示。
for i in range(len(oldList)):
if not hasUrl(newList, oldList[i][2]): # check if url is in list
newList.append(oldList[i])
Also, there is no need to create a range. 同样,也无需创建范围。 Python
for
loop iterates by values, so you can write just Python
for
循环按值迭代,因此您只需编写
for item in oldList:
if not hasUrl(newList, item[2]): # check if url is not in list
newList.append(item)
my_list = [[304070, '2015:01:01', 'http:something1'],
[304070, '2015:01:02', 'http:something2'],
[304070, '2015:01:03', 'http:something2'],
[304070, '2015:01:03', 'http:something2'],
[304071, '2015:01:04', 'http:something2'],
[304071, '2015:01:05', 'http:something3'],
[304071, '2015:01:06', 'http:something3']]
Pull out all of the urls from the original list. 从原始列表中拉出所有网址。 Create a set from this list to generate unique values for the urls.
从此列表创建一个集合,以生成URL的唯一值。 Use a list comprehension to iterate through this set and use
index
on the url list generated ( urls
) to locate the first occurrence of that url. 使用列表推导来遍历此集合,并使用生成的URL列表上的
index
( urls
)定位该urls
的首次出现。
Lastly, use another list comprehension together with enumerate
to select rows that have matching index values. 最后,结合使用另一个列表理解和
enumerate
来选择具有匹配索引值的行。
urls = [row[2] for row in my_list]
urls_unique = set(urls)
idx = [urls.index(url) for url in urls_unique]
my_shorter_list = [row for n, row in enumerate(my_list) if n in idx]
>>> my_shorter_list
[[304070, '2015:01:01', 'http:something1'],
[304070, '2015:01:02', 'http:something2'],
[304071, '2015:01:05', 'http:something3']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.