简体   繁体   中英

Most efficient way of finding common subexpressions in a list of strings in Java

I have a list of strings that represents package directories. I want to iterate the list, to find largest part of the strings where the packages are the same, then extract this substring, subtract that from the original list of strings to get specific packages so I create the appropriate directories.

I was thinking of creating the original list as a static hash set, then using the retainAll method, storing the result in a new String.

Would something like this be the most performant option, or is there a better way to do it?

Many thanks

This works for me, explanation in comments

// returns the length of the longest common prefix of all strings in the given array 
public static int longestCommonPrefix(String[] strings) {
    // Null or no contents, return 0
    if (strings == null || strings.length == 0) {
        return 0;
        // only 1 element? return it's length
    } else if (strings.length == 1 && strings[0] != null) {
        return strings[0].length();
        // more than 1
    } else {
        // copy the array and sort it on the lengths of the strings,
        // shortest one first.
        // this will raise a NullPointerException if an array element is null 
        String[] copy = Arrays.copyOf(strings, strings.length);
        Arrays.sort(copy, new Comparator<String>() {
            @Override
            public int compare(String o1, String o2) {
                return o2.length() - o1.length();
            }
        });
        int result = 0; // init result
        // iterate through every letter of the shortest string
        for (int i = 0; i < copy[0].length(); i++) { 
            // compare the corresponding char of all other strings
            char currenChar = copy[0].charAt(i);
            for (int j = 1; j < strings.length; j++) {                  
                if (currenChar != copy[j].charAt(i)) { // mismatch
                    return result;
                }
            }
            // all match
            result++;
        }
        // done iterating through shortest string, all matched.
        return result;
    }
}

If changing the original array does not bother you, you can omit the line String[] copy = Arrays.copyOf(strings, strings.length); and just sort your strings array.

To retrieve the text, change the return type to String and return something like return copy[0].substring(0, result + 1); within the loop and return copy[0]; at the end of the method.

If you are just looking for the single most common package, I would do the following:

Grab the first element from the list (call it the reference package). Using this package name I would iterate through the list. For each remaining element in the list, see if the element contains the reference package. If so move to the next element. If not trim your reference package by one package (taking aa.bb.cc.serverside and converting to aa.bb.cc ). Then see if the current element contains this new reference package. Repeat this until the reference package is empty or until the element matches. Then continue down the list of packages.

This will give you the largest most common package. Loop back through removing this from all elements in the list.

EDIT: Slight modification, better keep the . at the end of the package name to ensure complete package name.

Just sort them. The common prefixes will appear first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM