简体   繁体   English

确定矩阵A是否为矩阵B的子集

[英]Determine if matrix A is subset of matrix B

For a matrix such as 对于矩阵

A = [...
    12 34 67;
    90 78 15;
    10 71 24];

how could we determine efficiently if it is subset of a larger matrix? 我们如何才能有效地确定它是否是较大矩阵的子集?

B = [...
    12 34 67;                        % found
    89 67 45;
    90 78 15;                        % found  
    10 71 24;                        % found, so A is subset of B. 
    54 34 11];

Here are conditions: 条件如下:

  • all numbers are integers 所有数字都是整数
  • matrices are so large, ie, row# > 100000, column# may vary from 1 to 10 (same for A and B). 矩阵太大,即行号 > 100000, 列号可能从1到10不等(A和B相同)。

Edit: It seems that ismember for the case of this question, when called only few times works just fine. 编辑:似乎,对于此问题的情况下,只有几次调用ismember很好。 My initial impression was due to previous experiences where ismember was being invoked many times inside a nested loop resulting in the worst performance. 我的最初印象是由于以前的经验,在嵌套循环内多次调用ismember导致性能最差。

clear all; clc
n = 200000;
k = 10;
B = randi(n,n,k);
f = randperm(n);
A = B(f(1:1000),:);
tic
assert(sum(ismember(A,B,'rows')) == size(A,1));
toc
tic
assert(all(any(all(bsxfun(@eq,B,permute(A,[3,2,1])),2),1))); %user2999345
toc

which results in: 结果是:

Elapsed time is 1.088552 seconds.
Elapsed time is 12.154969 seconds.

Here are more benchmarks: 这里有更多基准:

clear all; clc
n = 20000;
f = randperm(n);
k = 10;
t1 = 0;
t2 = 0;
t3 = 0;
for i=1:7
    B = randi(n,n,k);
    A = B(f(1:n/10),:);
    %A(100,2) = 0;                                      % to make A not submat of B
    tic
    b = sum(ismember(A,B,'rows')) == size(A,1);
    t1 = t1+toc;
    assert(b);
    tic
    b = ismember_mex(A,sortrows(B));
    t2 = t2+toc;
    assert(b);
    tic
    b = issubmat(A,B);
    t3 = t3+toc;
    assert(b);
end

                             George's       skm's
                ismember | ismember_mex | issubmat
n=20000,k=10      0.6326      0.1064      11.6899
n=1000,k=100      0.2652      0.0155       0.0577
n=1000,k=1000     1.1705      0.1582       0.2202
n=1000,k=10000   13.2470      2.0033       2.6367
*issubmat eats RAM when n or k is over 10000!
*issubmat(A,B), A is being checked as submat of B. 

For small matrices ismember should be enough, probably. 对于小型矩阵, ismember应该足够了。 Usage: ismember(B,A,'rows') 用法: ismember(B,A,'rows')

ans =
   1
   0
   1
   1
   0

I put this answer here, emphasizing on a need to solutions with higher performance. 我把这个答案放在这里,强调需要更高性能的解决方案。 I will accept this answer only if there was no better solution. 仅当没有更好的解决方案时,我才会接受此答案。

Using ismember , if a row of A appears twice in B while another one is missing, might wrongly indicate that A is a member of B . 使用ismember ,如果一排A两次出现B另一个丢失时,可能会错误地表明, A是成员B The following solution is suitable if the rows of A and B doesn't need to be in the same order. 如果AB的行不需要相同的顺序,则以下解决方案是合适的。 However, I haven't tested its performance for large matrices. 但是,我尚未测试其在大型矩阵上的性能。

A = [...
34 12 67;
90 78 15;
10 71 24];
B = [...
34 12 67;                        % found
89 67 45;
90 78 15;                        % found  
10 71 24;                        % found, so A is subset of B. 
54 34 11];
A = permute(A,[3 2 1]);
rowIdx = all(bsxfun(@eq,B,A),2);
colIdx = any(rowIdx,1);
isAMemberB = all(colIdx);

You have said number of columns <= 10. In addition, if the matrix elements are all integers representable as bytes, you could code each row into a two 64 bit integers. 您已经说过列数<=10。此外,如果矩阵元素都是可表示为字节的整数,则可以将每一行编码为两个64位整数。 That would reduce the number of comparisons by a factor of 64. 这样可以将比较次数减少64倍。

For the general case, the following may not be all that much better for thin matrices, but scales very well as the matrices get fat due to the level 3 multiplication: 对于一般情况,以下情况对于薄矩阵而言可能并没有那么好,但由于3级乘法,随着矩阵变胖,其缩放效果很好:

function yes = is_submat(A,B)
   ma = size(A, 1);
   mb = size(B, 1);
   n = size(B, 2);

   yes = false;
   if ma >= mb
      a = A(:,1);
      b = B(:,1);

      D = (0 == bsxfun(@minus, a, b'));
      q = any(D, 2);

      yes = all(any(D,1));
      if yes && (n > 1)
         A = A(q, :);

         C = B*A';

         za = sum(A.*A, 2);
         zb = sum(B.*B, 2);
         Z = sqrt(zb)*sqrt(za');

         [~, ix] = max(C./Z, [], 2);

         A = A(ix,:);
         yes = all(A(:) == B(:));
      end
   end
end

In the above, I use the fact that the dot product is maximized when two unit vectors are equal. 在上面,我使用了这样的事实:当两个单位向量相等时,点积最大。

For fat matrices (say 5000+ columns) with large numbers of unique elements the performance beats ismember quite handily, but otherwise, it is slower than ismember. 对于具有大量独特元素的脂肪基质(例如5000列以上),其性能要好得多,但否则要慢于ismember。 For thin matrices ismember is faster by an order of magnitude. 对于薄矩阵,ismember快一个数量级。

Best case test for this function: 此功能的最佳案例测试:

A = randi(50000, [10000, 10000]);
B = A(2:3:end, :);
B = B(randperm(size(B,1)),:);
fprintf('%s: %u\n', 'Number of columns', size(A,2));
fprintf('%s: %u\n', 'Element spread', 50000);
tic; is_submat(A,B); toc;
tic; all(ismember(B,A,'rows')); toc;
fprintf('________\n\n');

is_submat_test; is_submat_test;

Number of columns: 10000 列数:10000

Element spread: 50000 元素传播:50000

Elapsed time is 10.713310 seconds (is_submat). 经过的时间是10.713310秒(is_submat)。

Elapsed time is 17.446682 seconds (ismember). 经过的时间为17.446682秒(ismember)。

So I have to admit, all round ismember seems to be much better. 因此,我不得不承认,全面的ismember似乎要好得多。

Edits: Edited to correct bug when there is only one column - fixing this also results in more efficient code. 编辑:只有一栏时进行编辑以更正错误-修复此问题还可以提高代码效率。 Also previous version did not distinguish between positive and negative numbers. 同样,以前的版本也没有区分正数和负数。 Added timing tests. 添加了计时测试。

It seems that ismember is hard to beat, at least using MATLAB code. 似乎至少使用MATLAB代码很难击败ismember。 I created a C implementation which can be used using the MEX compiler. 我创建了一个C实现,可以使用MEX编译器使用。

#include "mex.h"

#if MX_API_VER < 0x07030000
typedef int mwIndex;
typedef int mwSize;
#endif /* MX_API_VER */

#include <math.h>
#include <stdlib.h>
#include <string.h>

int ismember(const double *y, const double *x, int yrow, int xrow, int ncol);

void mexFunction(int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[])
{
    mwSize xcol, ycol, xrow, yrow;

    /* output data */
    int* result;

    /* arguments */
    const mxArray* y;
    const mxArray* x;

    if (nrhs != 2)
    {
        mexErrMsgTxt("2 input required.");
    }

    y = prhs[0];
    x = prhs[1];
    ycol = mxGetN(y);
    yrow = mxGetM(y);
    xcol = mxGetN(x);
    xrow = mxGetM(x);

    /* The first input must be a sparse matrix. */
    if (!mxIsDouble(y) || !mxIsDouble(x))
    {
        mexErrMsgTxt("Input must be of type 'double'.");
    }
    if (xcol != ycol)
    {
        mexErrMsgTxt("Inputs must have the same number of columns");
    }

    plhs[0] = mxCreateLogicalMatrix(1, 1);
    result = mxGetPr(plhs[0]);
    *result = ismember(mxGetPr(y), mxGetPr(x), yrow, xrow, ycol);
}

int ismemberinner(const double *y, int idx, const double *x, int yrow, int xrow, int ncol) {
    int from, to, i;
    from = 0;
    to = xrow-1;

    for(i = 0; i < ncol; ++i) {
        // Perform binary search
        double yi = *(y + i * yrow + idx);
        double *curx = x + i * xrow;
        int l = from;
        int u = to;
        while(l <= u) {
            int mididx = l + (u-l)/2;
            if(yi < curx[mididx]) {
                u = mididx-1;
            }
            else if(yi > curx[mididx]) {
                l = mididx+1;
            }
            else {
                // This can be further optimized by performing additional binary searches
                for(from = mididx; from > l && curx[from-1] == yi; --from);
                for(to = mididx; to < u && curx[to+1] == yi; ++to);
                break;
            }
        }
        if(l > u) {
            return 0;
        }
    }
    return 1;
}

int ismember(const double *y, const double *x, int yrow, int xrow, int ncol) {
    int i;
    for(i = 0; i < yrow; ++i) {
        if(!ismemberinner(y, i, x, yrow, xrow, ncol)) {
            return 0;
        }
    }
    return 1;
}

Compile it using: 使用以下命令进行编译:

mex -O ismember_mex.c

It can be called as follows: 可以这样称呼:

ismember_mex(x, sortrows(x))

First of all, it assumes that the columns of the matrices have the same size. 首先,它假定矩阵的列具有相同的大小。 It works by first sorting the rows of the larger matrix (x in this case, the second argument to the function). 它的工作方式是首先对较大矩阵的行进行排序(在这种情况下,x为函数的第二个参数)。 Then, a type of binary search is employed to identify whether the rows of the smaller matrix (y hereafter) are contained in x. 然后,采用一种二进制搜索来识别较小的矩阵的行(此后为y)是否包含在x中。 This is done for each row of y separately (see ismember C function). 分别对y的每一行完成此操作(请参见ismember C函数)。 For a given row of y, it starts from the first entry and finds the range of indices (using the from and to variables) that match with the first column of x using binary search. 对于y的给定行,它从第一个条目开始,并使用二进制搜索找到与x的第一列匹配的索引范围(使用fromto变量)。 This is repeated for the remaining entries, unless some value is not found, in which case it terminates and returns 0. 除非未找到某些值,否则将对其余条目重复此操作,在这种情况下它将终止并返回0。

I tried implementing it this idea in MATLAB, but it didn't work that well. 我尝试在MATLAB中实现此想法,但效果并不理想。 Regarding performance, I found that: (a) in case there are mismatches, it is usually much faster than ismember (b) in case the range of values in x and y is large, it is again faster than ismember , and (c) in case everything matches and the number of possible values in x and y is small (eg less than 1000), then ismember may be faster in some situations. 关于性能,我发现:(a)在存在不匹配的情况下,通常比ismember快得多ismember (b)在x和y中的值范围较大时,它又比ismember更快,并且(c)如果一切都匹配并且x和y中的可能值的数量很小(例如,小于1000),则在某些情况下ismember可能会更快。 Finally, I want to point out that some parts of the C implementation may be further optimized. 最后,我想指出,C实现的某些部分可能会进一步优化。

EDIT 1 编辑1

I fixed the warnings and further improved the function. 我修复了警告并进一步改进了功能。

#include "mex.h"
#include <math.h>
#include <stdlib.h>
#include <string.h>

int ismember(const double *y, const double *x, unsigned int nrowy, unsigned int nrowx, unsigned int ncol);

void mexFunction(int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[])
{
    unsigned int xcol, ycol, nrowx, nrowy;

    /* arguments */
    const mxArray* y;
    const mxArray* x;

    if (nrhs != 2)
    {
        mexErrMsgTxt("2 inputs required.");
    }

    y = prhs[0];
    x = prhs[1];
    ycol = (unsigned int) mxGetN(y);
    nrowy = (unsigned int) mxGetM(y);
    xcol = (unsigned int) mxGetN(x);
    nrowx = (unsigned int) mxGetM(x);

    /* The first input must be a sparse matrix. */
    if (!mxIsDouble(y) || !mxIsDouble(x))
    {
        mexErrMsgTxt("Input must be of type 'double'.");
    }
    if (xcol != ycol)
    {
        mexErrMsgTxt("Inputs must have the same number of columns");
    }

    plhs[0] = mxCreateLogicalScalar(ismember(mxGetPr(y), mxGetPr(x), nrowy, nrowx, ycol));
}

int ismemberinner(const double *y, const double *x, unsigned int nrowy, unsigned int nrowx, unsigned int ncol) {
    unsigned int from = 0, to = nrowx-1, i;

    for(i = 0; i < ncol; ++i) {
        // Perform binary search
        const double yi = *(y + i * nrowy);
        const double *curx = x + i * nrowx;
        unsigned int l = from;
        unsigned int u = to;
        while(l <= u) {
            const unsigned int mididx = l + (u-l)/2;
            const double midx = curx[mididx];
            if(yi < midx) {
                u = mididx-1;
            }
            else if(yi > midx) {
                l = mididx+1;
            }
            else {
                {
                    // Binary search to identify smallest index of x that equals yi
                    // Equivalent to for(from = mididx; from > l && curx[from-1] == yi; --from)
                    unsigned int limit = mididx;
                    while(curx[from] != yi) {
                        const unsigned int mididx = from + (limit-from)/2;
                        if(curx[mididx] < yi) {
                            from = mididx+1;
                        }
                        else {
                            limit = mididx-1;
                        }
                    }
                }
                {
                    // Binary search to identify largest index of x that equals yi
                    // Equivalent to for(to = mididx; to < u && curx[to+1] == yi; ++to);
                    unsigned int limit = mididx;
                    while(curx[to] != yi) {
                        const unsigned int mididx = limit + (to-limit)/2;
                        if(curx[mididx] > yi) {
                            to = mididx-1;
                        }
                        else {
                            limit = mididx+1;
                        }
                    }
                }
                break;
            }
        }
        if(l > u) {
            return 0;
        }
    }
    return 1;
}

int ismember(const double *y, const double *x, unsigned int nrowy, unsigned int nrowx, unsigned int ncol) {
    unsigned int i;
    for(i = 0; i < nrowy; ++i) {
        if(!ismemberinner(y + i, x, nrowy, nrowx, ncol)) {
            return 0;
        }
    }
    return 1;
}

Using this version I wasn't able to identify any case where ismember is faster. 使用此版本,我无法确定ismember更快的任何情况。 Also, I noticed that one reason ismember is hard to beat is that it uses all cores of the machine! 另外,我注意到ismember难以击败的一个原因是它使用了机器的所有内核! Of course, the function I provided can be optimized to do this too, but this requires much more effort. 当然,我提供的功能也可以进行优化,但这需要付出更多的努力。

Finally, before using my implementation I would advise you to do extensive testing. 最后,在使用我的实现之前,我建议您进行广泛的测试。 I did some testing and it seems to work, but I suggest you also do some additional testing. 我进行了一些测试,而且似乎可以正常工作,但我建议您也进行一些其他测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM