简体   繁体   English

将一个列表中的所有元素与另一个列表保持一致

[英]Keep all elements in one list from another

I have two large lists train and keep , with the latter containing unique elements, for eg我有两个列表trainkeep ,后者包含独特的元素,例如

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

Is there a way to create a new list that has all the elements of train that are in keep using sets ?有没有办法创建一个新列表,其中包含keep使用setstrain所有元素? The end result should be:最终结果应该是:

train_keep = [1, 3, 4, 3, 1]

Currently I'm using itertools.filterfalse from how to keep elements of a list based on another list but it is very slow as the lists are large...目前我正在使用itertools.filterfalse from how to keep a list of elements based on another list但它慢,因为列表很大......

Convert the list keep into a set , since that will be checked frequently.将列表keep转换为set ,因为它将经常检查。 Iterate over train , since you want to keep order and repeats.迭代train ,因为您想保持顺序并重复。 That makes set not an option.这使得set不是一个选项。 Even if it was, it wouldn't help, since the iteration would have to happen anyway:即使是这样,也无济于事,因为无论如何迭代都必须发生:

keeps = set(keep)
train_keep = [k for k in train if k in keeps]

A lazier, and probably slower version would be something like一个懒惰的,可能更慢的版本会像

train_keep = filter(lambda x: x in keeps, train)

Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects.这些选项都不会给你带来很大的加速,你可能最好使用 numpy 或 pandas 或其他一些在 C 中实现循环并将数字存储为比完整的 python 对象更简单的库。 Here is a sample numpy solution:这是一个示例 numpy 解决方案:

train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]

This is likely an O(M * N) algorithm rather than O(M) set lookup, but if checking N elements in keep is faster than a nominally O(1) lookup, you win.这可能是一个O(M * N)算法而不是O(M)集查找,但是如果检查keep N元素比名义上的O(1)查找快,那么你就赢了。

You can get something closer to O(M log(N)) using sorted lookup:您可以使用排序查找获得更接近O(M log(N))东西:

train = np.array([...])
keep = np.array([...])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]

A better alternative might be to append np.inf or a maximum out-of-bounds integer to the sorted keep array, so you don't have to distinguish missing from edge elements with extra at all.更好的替代方法可能是将np.inf或最大越界整数附加到已排序的keep数组中,这样您就不np.inf缺失与extra边缘元素区分开来。 Something like np.max(train.max() + 1, keep.max()) would do:np.max(train.max() + 1, keep.max())这样的东西会做:

train = np.array([...])
keep = np.array([... 99999])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]

For random inputs with train.size = 10000 and keep.size = 10 , the numpy method is ~10x faster on my laptop.对于train.size = 10000keep.size = 10随机输入,numpy 方法在我的笔记本电脑上快约 10 倍。

>>> keep_set = set(keep)
>>> [val for val in train if val in keep_set]
[1, 3, 4, 3, 1]

Note that if keep is small, there might not be any performance advantage to converting it to a set (benchmark to make sure).请注意,如果keep很小,则将其转换为set可能没有任何性能优势(以确保为基准)。

this is an option:这是一个选项:

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

keep_set = set(keep)
res = [item for item in train if item in keep_set]
# [1, 3, 4, 3, 1]

i use keep_set in order to speed up the look-up a bit.我使用keep_set来加快查找速度。

The logic is the same, but give a try, maybe a generator is faster for your case:逻辑是相同的,但请尝试一下,对于您的情况,生成器可能更快:

def keep_if_in(to_keep, ary):
  for element in ary:
    if element in to_keep:
      yield element

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
train_keep = keep_if_in(set(keep), train)

Finally, convert to a list when required or iterate directly the generator:最后,在需要时转换为列表或直接迭代生成器:

print(list(train_keep))

#  alternatively, uncomment this and comment out the line above,
#  it's because a generator can be consumed once
#  for e in train_keep:
#    print(e)

This is a slight expansion of Mad Physicist's clever technique, to cover a situation where the lists contain characters and one of them is a dataframe column (I was trying to find a list of items in a dataframe, including all duplicates, but the obvious answer, mylist.isin(df['col') removed the duplicates).这是对 Mad Physicist 的巧妙技巧的略微扩展,以涵盖列表包含字符且其中一个是数据框列的情况(我试图在数据框中查找项目列表,包括所有重复项,但显而易见的答案是, mylist.isin(df['col')删除了重复项)。 I adapted his answer to deal with the problem of possible truncation of character data by Numpy.我调整了他的回答来处理 Numpy 可能截断字符数据的问题。

#Sample dataframe with strings
d = {'train': ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510l','ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510d02','ABC_S8#Q09#2#510c#8y','ABC_S8#Q09#2#510a#6'], 'col2': [1,2,3,4,5,6]}
df = pd.DataFrame(data=d)

keep_list = ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510b13','ABC_S8#Q09#2#510c#8y']

#Make sure the Numpy datatype accomodates longest string in either list
maxlen = max(len(max(keep_list, key = len)),len(max(df['train'], key = len))) 
strtype = '<U'+ str(maxlen) 

#Convert lists to Numpy arrays
keep = np.array(keep_list,dtype = strtype)
train = np.array(df['train'],dtype = strtype)

#Algorithm
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = df[keep[ind] == df['train']] #reference the original dataframe

I found this to be much faster than other solutions I tried.我发现这比我尝试过的其他解决方案要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从另一个列表中的一个元素中减去列表中的所有元素? - How to subtract all elements in list from one element in another list? 将列表中的一个元素与另一个列表的所有元素进行比较 - Comparing one element from a list to ALL elements of another list 用另一个列表中的所有元素替换一个列表中的元素 - Replacing an element from one list with all elements in another list 从另一个列表中删除出现在一个列表中的所有元素 - Remove all the elements that occur in one list from another 从中创建数据框时如何将列表的所有元素保留在一行中 - How to keep all elements of a list in one row while creating a dataframe from it 检查另一个列表中一个列表中的元素 - Check for elements in one list from another list 从另一个列表中减去一个列表的所有元素的最简单方法是什么? - What's the simplest way to subtract all elements of one list from another list? Python将一个列表中的元素与另一个列表中的所有元素组合 - Python Combine Elements in one List With All Elements in Another 从包含在另一个列表中的列表中查找元素 - Finding elements from a list included in another one 用另一个列表中的元素替换一个列表中的特定数量的元素 - Replacing a specific number of elements in one list with elements from another list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM