简体   繁体   English

在 numpy 中快速找到对称对

[英]Find symmetric pairs quickly in numpy

from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
#     c1  c2
# 0    0   0
# 1    0   1
# 2    0   2
# 3    0   3
# 4    0   4
# ..  ..  ..
# 85   9   4
# 86   9   5
# 87   9   7
# 88   9   8
# 89   9   9
# 
# [90 rows x 2 columns]

How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?如何快速查找、识别和删除此数据框中所有对称对的最后一个副本?

An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'.对称对的一个例子是 '(0, 1)' 等于 '(1, 0)'。 The latter should be removed.后者应该被删除。

The algorithm must be fast, so it is recommended to use numpy.算法一定要快,所以推荐使用numpy。 Converting to python object is not allowed.不允许转换为 python object。

You can sort the values, then groupby :您可以对值进行排序,然后groupby

a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()

Option 2 : If you have a lot of pairs c1, c2 , groupby can be slow.选项 2 :如果你有很多对c1, c2groupby可能会很慢。 In that case, we can assign new values and filter by drop_duplicates :在这种情况下,我们可以分配新值并按drop_duplicates过滤:

a= np.sort(df.to_numpy(), axis=1) 

(df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
   .drop_duplicates(['one','two'])   # taken from above
   .reindex(df.columns, axis=1)
)

One way is using np.unique with return_index=True and use the result to index the dataframe:一种方法是使用带有np.unique return_index=True的 np.unique 并使用结果来索引 dataframe:

a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)

print(df.iloc[ix, :])

    c1  c2
0    0   0
1    0   1
20   2   0
3    0   3
40   4   0
50   5   0
6    0   6
70   7   0
8    0   8
9    0   9
11   1   1
21   2   1
13   1   3
41   4   1
51   5   1
16   1   6
71   7   1
...

frozenset

mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()

df[~mask]

I will do我会做

df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]

From pandas and numpy tri从 pandas 和 numpy 三

s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()

Here's one NumPy based one for integers -这是一个基于 NumPy 的整数 -

def remove_symm_pairs(df):
    a = df.to_numpy(copy=False)
    b = np.sort(a,axis=1)
    idx = np.ravel_multi_index(b.T,(b.max(0)+1))
    sidx = idx.argsort(kind='mergesort')
    p = idx[sidx]
    m = np.r_[True,p[:-1]!=p[1:]]
    a_out = a[np.sort(sidx[m])]
    df_out = pd.DataFrame(a_out)
    return df_out

If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])] .如果要保持索引数据不变,请使用return df.iloc[np.sort(sidx[m])]

For generic numbers (ints/floats, etc.), we will use a view-based one -对于通用数字(整数/浮点数等),我们将使用基于view-based数字 -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs .并简单地将获取idx的步骤替换为 remove_symm_pairs 中的idx = view1D(b) remove_symm_pairs

If this needs to be fast , and if your variables are integer, then the following trick may help: let v,w be the columns of your vector;如果这需要快速,并且如果您的变量是 integer,那么以下技巧可能会有所帮助:让v,w成为向量的列; construct [v+w, np.abs(vw)] =: [x, y] ;构造[v+w, np.abs(vw)] =: [x, y] ; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (xy)]/2 .然后按字典顺序对该矩阵进行排序,删除重复项,最后 map 将其返回[v, w] = [(x+y), (xy)]/2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM