简体   繁体   中英

Finding set of elements common in set of arrays

Suppose there are several arrays :

A. [1,2,3,4,5,6,7,8,9,10]
B. [2,4,6,8,10]
C. [1,4,7,10]
D. [1,3,5,7,9]
.
.

I need to find out all possible sets of elements (1,2,3,4,5 ...) each of which is common in at-least 2 arrays (A,B,C....) and show them in following manner:

(2,4,6,8,10) -> (A,B)
(1,4,7,10) -> (A,C)
(1,3,5,7,9) -> (A,D)
(4,10) -> (A,B,C)
(1,7) -> (A,C,D)

The actual inputs are files containing strings. There could be thousands of files and each file could contain more than hundred key string.

I have tried the following approach : First I generated sets of elements by comparing all possible pairs of arrays. Then I tried to generate other sets by using the logic - intersect of set of elements is common in union of set of arrays. Like this:

(2,4,6,8,10) -> (A,B)
(1,4,7,10) -> (A,C)

from above we can get:

    intersect((2,4,6,8,10),(1,4,7,10)) -> union((A,B),(A,C))
or, (4,10) -> (A,B,C)

Is there any other approach that I can try to improve time and memory complexity - considering thousand input file containing hundreds of elements each?

I would use the following approach.

  1. Scan the entire data to obtain a set of the elements which occur in the data.
  2. Maintain a counter for each element; scan the data again and increase the counter for each element if it occurs.
  3. Discard all elements wichs occur less than 2 times.
  4. Generate all possible subsets of the remaining elements. For each subset, scan the data and output each array identifier if any element of the set occurs.

Use a hash-map(or a map, if you need to worry about collisions). Pseudo-code below:

for file in file_list:
   for word in file:
      hash_map[word].append(file)

for wordkey in hash_map:
   print pick_uniques(hash_map[wordkey])

This approach has complexity O(total number of words), ignoring the length of each word.

EDIT : Since you also want to combine wordkey s with the same pick_uniques(hash_map[wordkey]) , you can apply the same hash-map method, this time inverting the keys.

This Java class:

public class Store {
Map<Integer,Set<String>> int2keyset = new HashMap<>();
Set<Set<String>> setOfKeyset = new HashSet<>();

public void enter( String key, Integer[] integers ){
    for( Integer val: integers ){
        Set<String> keySet = int2keyset.get( val );
        Set<String> newKeySet = null;
        if( keySet == null ){
            newKeySet = new HashSet<String>();
            newKeySet.add( key );       
        } else {
            newKeySet = new HashSet<>( keySet );
            newKeySet.add( key );
        }
        setOfKeyset.remove( newKeySet );
        setOfKeyset.add( newKeySet );
        int2keyset.put( val, newKeySet );
    }
}

public void dump(){
    Map<Set<String>,Set<Integer>> keySet2intSet = new HashMap<>();
    for( Map.Entry<Integer,Set<String>> entry: int2keyset.entrySet() ){
        Integer intval = entry.getKey();
        Set<String> keySet = entry.getValue();
        Set<Integer> intSet = keySet2intSet.get( keySet );
        if( intSet == null ){
            intSet = new HashSet<Integer>();
        }
        intSet.add( intval );
        keySet2intSet.put( keySet,intSet );
    }
    for( Map.Entry<Set<String>,Set<Integer>> entry: keySet2intSet.entrySet() ){
         System.out.println( entry.getValue() + " => " + entry.getKey() );
}
}
}

when fed with the lines given in the question produces:

[2, 6, 8] => [A, B]
[3, 5, 9] => [A, D]
[4, 10] => [A, B, C]
[1, 7] => [A, C, D]

Although it is not identical to the expected output, it does contain all the information to produce that, and is much more compact. If a large number of input lines is to be expected, it might be worth pursuing a way that keeps the stored information as compact as possible, and I've tried to follow this guideline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM