[英]Find symmetric pairs quickly in numpy
from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?如何快速查找、识别和删除此数据框中所有对称对的最后一个副本?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'.对称对的一个例子是 '(0, 1)' 等于 '(1, 0)'。 The latter should be removed.
后者应该被删除。
The algorithm must be fast, so it is recommended to use numpy.算法一定要快,所以推荐使用numpy。 Converting to python object is not allowed.
不允许转换为 python object。
You can sort the values, then groupby
:您可以对值进行排序,然后
groupby
:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2 : If you have a lot of pairs c1, c2
, groupby
can be slow.选项 2 :如果你有很多对
c1, c2
, groupby
可能会很慢。 In that case, we can assign new values and filter by drop_duplicates
:在这种情况下,我们可以分配新值并按
drop_duplicates
过滤:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique
with return_index=True
and use the result to index the dataframe:一种方法是使用带有
np.unique
return_index=True
的 np.unique 并使用结果来索引 dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do我会做
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri从 pandas 和 numpy 三
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -这是一个基于 NumPy 的整数 -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])]
.如果要保持索引数据不变,请使用
return df.iloc[np.sort(sidx[m])]
。
For generic numbers (ints/floats, etc.), we will use a view-based
one -对于通用数字(整数/浮点数等),我们将使用基于
view-based
数字 -
# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx
with idx = view1D(b)
in remove_symm_pairs
.并简单地将获取
idx
的步骤替换为 remove_symm_pairs 中的idx = view1D(b)
remove_symm_pairs
If this needs to be fast , and if your variables are integer, then the following trick may help: let v,w
be the columns of your vector;如果这需要快速,并且如果您的变量是 integer,那么以下技巧可能会有所帮助:让
v,w
成为向量的列; construct [v+w, np.abs(vw)] =: [x, y]
;构造
[v+w, np.abs(vw)] =: [x, y]
; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (xy)]/2
.然后按字典顺序对该矩阵进行排序,删除重复项,最后 map 将其返回
[v, w] = [(x+y), (xy)]/2
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.