简体   繁体   中英

Best way to find a subset of a set from a group of sets

First, sorry for the ambiguous title.

Assume I have the following group of sets:

Group 1

s1 = ( x1, y1 )
s2 = ( x2 )

Group 2

m1 = ( x1, y1, y2 )
m2 = ( x1 )
m3 = ( x1 , x2 )

For each of the sets in Group 1 - call the set s , I need to find the sets in Group 2 - call it m - such that m is a subset of s .

So, for my example, the answer would be:

s1 -> m2
s2 -> nothing

For now, I'm storing the values in std:set , but I can change that if needed. Also, the sets can get big, so the algorithm needs to be efficient. For now I have a brute-force approach, which I'm not entirely satisfied with.

Any suggestions?

The first step would be to sort Group 1 according to cardinality (ie size). Then the algorithm is something on the order of:

foreach std::set M in "Group 2" {
  foreach std::set S in "Group 1" and S.size()>=M.size() {  // replace with binary search
     if ( std::includes(S.begin(),S.end(),M.begin(),M.end()) )
       { /* M is a subset of S */ }
    }
  }
}

This should have time complexity ~O(MSR), where M is the # of sets in "Group 2", S the # of sets in "Group 1", and R is the size of largest set in "Group #1".

Edit: It just occurred to me that it might be more efficient to use S.find() rather than calling std::includes() (which iterates sequentially) but I think that would only be true if M.size() is much smaller than S.size() -- O(M+S) vs O(MlogS).

You are not specific about how brute-force your approach is. As long as you are using the set query functions in the std:: namespace then they are likely to be as efficient as they can be. For example testing if set_intersection( s1.begin(), s2.end(), m1.begin(), m1.end() ) is equivalent to m1.

You could be more efficient than this, as you do not want a copy of the matching elements, just to know they all appear. This could be done by copying the code of set_intersection but changing the implementation to simply count the number of matching elements rather than copying them out. Then if the count is the same as the size of m then you have a match.

As for containers, I often prefer a sorted deque over a set for large collections. The memory is much less distributed over the heap which helps with caching. It also avoids the overhead of the underlying tree. This is especially beneficial when the containers are created once, but are searched multiple times.

Are your sets frequently modified or are they read-only/mostly?

  • If frequently modified, std::set is a fine balance between modification and sort performance.
  • If read-only or read-mostly, you can use a sorted std::vector . Sorting is expensive, but is actually cheaper than constructing a whole tree in the std::set , so performance is better if you do it rarely enough.

Once you have made the sorted containers (be it "auto-sorted" std::set or manually-sorted std::vector ), you can test for subset using std::includes . BTW, if you need to find proper subsets, you can compare element counts afterwards.

You can try something like this. Steps:

  • create an array that contains all objects in both group
  • convert every s and m in a bit array, where array(i)=1 if the set contains object(i), 0 otherwise
  • m(k) is a subset of s(j) if m(k) AND s(j) = m(k)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM