[英]Keep all elements in one list from another
I have two large lists train
and keep
, with the latter containing unique elements, for eg我有两个大列表train
和keep
,后者包含独特的元素,例如
train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
Is there a way to create a new list that has all the elements of train
that are in keep
using sets
?有没有办法创建一个新列表,其中包含keep
使用sets
的train
所有元素? The end result should be:最终结果应该是:
train_keep = [1, 3, 4, 3, 1]
Currently I'm using itertools.filterfalse
from how to keep elements of a list based on another list but it is very slow as the lists are large...目前我正在使用itertools.filterfalse
from how to keep a list of elements based on another list但它很慢,因为列表很大......
Convert the list keep
into a set
, since that will be checked frequently.将列表keep
转换为set
,因为它将经常检查。 Iterate over train
, since you want to keep order and repeats.迭代train
,因为您想保持顺序并重复。 That makes set
not an option.这使得set
不是一个选项。 Even if it was, it wouldn't help, since the iteration would have to happen anyway:即使是这样,也无济于事,因为无论如何迭代都必须发生:
keeps = set(keep)
train_keep = [k for k in train if k in keeps]
A lazier, and probably slower version would be something like一个懒惰的,可能更慢的版本会像
train_keep = filter(lambda x: x in keeps, train)
Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects.这些选项都不会给你带来很大的加速,你可能最好使用 numpy 或 pandas 或其他一些在 C 中实现循环并将数字存储为比完整的 python 对象更简单的库。 Here is a sample numpy solution:这是一个示例 numpy 解决方案:
train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]
This is likely an O(M * N)
algorithm rather than O(M)
set lookup, but if checking N
elements in keep
is faster than a nominally O(1)
lookup, you win.这可能是一个O(M * N)
算法而不是O(M)
集查找,但是如果检查keep
N
元素比名义上的O(1)
查找快,那么你就赢了。
You can get something closer to O(M log(N))
using sorted lookup:您可以使用排序查找获得更接近O(M log(N))
东西:
train = np.array([...])
keep = np.array([...])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
A better alternative might be to append np.inf
or a maximum out-of-bounds integer to the sorted keep
array, so you don't have to distinguish missing from edge elements with extra
at all.更好的替代方法可能是将np.inf
或最大越界整数附加到已排序的keep
数组中,这样您就不np.inf
缺失与extra
边缘元素区分开来。 Something like np.max(train.max() + 1, keep.max())
would do:像np.max(train.max() + 1, keep.max())
这样的东西会做:
train = np.array([...])
keep = np.array([... 99999])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]
For random inputs with train.size = 10000
and keep.size = 10
, the numpy method is ~10x faster on my laptop.对于train.size = 10000
和keep.size = 10
随机输入,numpy 方法在我的笔记本电脑上快约 10 倍。
>>> keep_set = set(keep)
>>> [val for val in train if val in keep_set]
[1, 3, 4, 3, 1]
Note that if keep
is small, there might not be any performance advantage to converting it to a set
(benchmark to make sure).请注意,如果keep
很小,则将其转换为set
可能没有任何性能优势(以确保为基准)。
this is an option:这是一个选项:
train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
keep_set = set(keep)
res = [item for item in train if item in keep_set]
# [1, 3, 4, 3, 1]
i use keep_set
in order to speed up the look-up a bit.我使用keep_set
来加快查找速度。
The logic is the same, but give a try, maybe a generator is faster for your case:逻辑是相同的,但请尝试一下,对于您的情况,生成器可能更快:
def keep_if_in(to_keep, ary):
for element in ary:
if element in to_keep:
yield element
train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
train_keep = keep_if_in(set(keep), train)
Finally, convert to a list when required or iterate directly the generator:最后,在需要时转换为列表或直接迭代生成器:
print(list(train_keep))
# alternatively, uncomment this and comment out the line above,
# it's because a generator can be consumed once
# for e in train_keep:
# print(e)
This is a slight expansion of Mad Physicist's clever technique, to cover a situation where the lists contain characters and one of them is a dataframe column (I was trying to find a list of items in a dataframe, including all duplicates, but the obvious answer, mylist.isin(df['col')
removed the duplicates).这是对 Mad Physicist 的巧妙技巧的略微扩展,以涵盖列表包含字符且其中一个是数据框列的情况(我试图在数据框中查找项目列表,包括所有重复项,但显而易见的答案是, mylist.isin(df['col')
删除了重复项)。 I adapted his answer to deal with the problem of possible truncation of character data by Numpy.我调整了他的回答来处理 Numpy 可能截断字符数据的问题。
#Sample dataframe with strings
d = {'train': ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510l','ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510d02','ABC_S8#Q09#2#510c#8y','ABC_S8#Q09#2#510a#6'], 'col2': [1,2,3,4,5,6]}
df = pd.DataFrame(data=d)
keep_list = ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510b13','ABC_S8#Q09#2#510c#8y']
#Make sure the Numpy datatype accomodates longest string in either list
maxlen = max(len(max(keep_list, key = len)),len(max(df['train'], key = len)))
strtype = '<U'+ str(maxlen)
#Convert lists to Numpy arrays
keep = np.array(keep_list,dtype = strtype)
train = np.array(df['train'],dtype = strtype)
#Algorithm
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = df[keep[ind] == df['train']] #reference the original dataframe
I found this to be much faster than other solutions I tried.我发现这比我尝试过的其他解决方案要快得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.