简体   繁体   English

Java:如何比较两个字符串以获得它们不同的部分?

[英]Java: how to compare two strings in order to obtain the portions where they differ?

I would like to learn a way to obtain the portions where two strings differ. 我想学习一种获取两个字符串不同的部分的方法。

Suppose I have these two strings: 假设我有这两个字符串:

String s1 = "x4.printString(\"Bianca.()\").y1();";
String s2 = "sb.printString(\"Bianca.()\").length();";

I would like this output: ["x4", "y1", "sb", "length"] coming from a method receiving s1 and s2 as arguments. 我希望这个输出: ["x4", "y1", "sb", "length"]来自接收s1s2作为参数的方法。

I have looked for something like this in other posts, but I have only found links to StringUtils.difference(String first, String second) . 我在其他帖子中找到了类似的东西,但我只找到了StringUtils.difference的链接(String first,String second)

But this method returns the second string from the index where it begins to differ from the first one. 但是这个方法从索引开始返回第二个字符串,它开始与第一个字符串不同。
I really don't know where to start and any advice would be very appreciated. 我真的不知道从哪里开始,任何建议都会非常感激。

UPDATE Following @aUserHimself advises, I managed to obtain all common subsequences among the two strings, but these subsequences come out like a unique String. 更新在@aUserHimself建议之后,我设法获得了两个字符串中的所有常见子序列,但这些子序列就像一个唯一的字符串。

This is my code now: 这是我现在的代码:

private static int[][] lcs(String s, String t) {
    int m, n;
    m = s.length();
    n = t.length();
    int[][] table = new int[m+1][n+1];
    for (int i=0; i < m+1; i++)
        for (int j=0; j<n+1; j++)
            table[i][j] = 0;
    for (int i = 1; i < m+1; i++)
        for (int j = 1; j < n+1; j++)
            if (s.charAt(i-1) == t.charAt(j-1))
                table[i][j] = table[i-1][j-1] + 1;
            else
                table[i][j] = Math.max(table[i][j-1], table[i-1][j]);
    return table;
}

private static List<String> backTrackAll(int[][]table, String s, String t, int m, int n){
    List<String> result = new ArrayList<>();
    if (m == 0 || n == 0) {
        result.add("");
        return result;
    }
    else
        if (s.charAt(m-1) == t.charAt(n-1)) {
            for (String sub : backTrackAll(table, s, t, m - 1, n - 1))
                result.add(sub + s.charAt(m - 1));
            return result;
        }
        else {
            if (table[m][n - 1] >= table[m - 1][n])
                result.addAll(backTrackAll(table, s, t, m, n - 1));
            else
                result.addAll(backTrackAll(table, s, t, m - 1, n));
            return result;
        }
}

private List<String> getAllSubsequences(String s, String t){
    return backTrackAll(lcs(s, t), s, t, s.length(), t.length());
}



Calling getAllSubsequences on these two strings: 在这两个字符串上调用getAllSubsequences

String s1 = "while (x1 < 5)"
String s2 = "while (j < 5)"

I receive this string: ["while ( < 5)"] not ["while (", " < 5)"] as I would like to obtain. 我收到这个字符串: ["while ( < 5)"]不是["while (", " < 5)"]因为我想获得。 I am not understanding where I am doing wrong. 我不明白我做错了什么。

Find the longest common subsequence between two strings. 找到两个字符串之间最长的公共子序列。 After that you can use indexOf to get index of this common string in between both strings and fetch uncommon values from both. 之后,您可以使用indexOf在两个字符串之间获取此公共字符串的索引,并从两者中获取不常见的值。

example : 例如:

CICROSOFK
WOCROSFGT

Common letter is 常见的信是

CROS

Find different string from 0 to index of SOFT and from index+'SOFT'.length to str.length 找到从0到SOFT索引的不同字符串,从index+'SOFT'.lengthstr.length

I already flagged a duplicate question above whose answer uses Longest Common Subsequence for 2 Strings. 我已经上面标记了一个重复的问题 ,其答案使用了2个字符串的Longest Common Subsequence序列。

So you can apply it recursively and on each new recursion, use a placeholder where this LCS was found so that you can mark the parts that differ. 因此,您可以递归地应用它并在每个新的递归上,使用找到此LCSplaceholder ,以便您可以标记不同的部分。 In the end, when no more common sequences exist, you will have to split each String by the placeholder and get the required parts. 最后,当不存在更多常见序列时,您必须通过placeholder拆分每个String并获取所需的部分。

UPDATE 1: If I think better now, this recursion part might not lead to an optimal solution (from the total execution time point of view), since you will iterate over the Strings multiple times. UPDATE 1:如果我现在想的更好,这个递归部分可能不会导致最佳解决方案(从总执行时间的角度来看),因为您将多次迭代字符串。 But there might be a way to retrieve all sequences from one iteration by reusing (a reduced version of) the memoization table, check this implementation and this more detailed one . 但是可能有一种方法可以通过重用(简化版本) memoization表来从一次迭代中检索所有序列,检查此实现以及更详细的 实现

UPDATE 2: I have managed to implement the recursive version (not optimal), based on this code : UPDATE 2:我已经设法基于以下代码实现递归版本(非最佳):

public class LongestCommonSequence {

    private final char[] firstStr;
    private final char[] secondStr;
    private int[][] LCS;
    private String[][] solution;
    private int max = -1, maxI = -1, maxJ = -1;
    private static final Character SEPARATOR = '|';

    public LongestCommonSequence(char[] firstStr, char[] secondStr) {
        this.firstStr = firstStr;
        this.secondStr = secondStr;
        LCS = new int[firstStr.length + 1][secondStr.length + 1];
        solution = new String[firstStr.length + 1][secondStr.length + 1];
    }

    public String find() {

        for (int i = 0; i <= secondStr.length; i++) {
            LCS[0][i] = 0;
            if(i > 0) {
                solution[0][i] = "   " + secondStr[i - 1];
            }
        }

        for (int i = 0; i <= firstStr.length; i++) {
            LCS[i][0] = 0;
            if(i > 0) {
                solution[i][0] = "   " + firstStr[i - 1];
            }
        }

        solution[0][0] = "NONE";

        for (int i = 1; i <= firstStr.length; i++) {
            for (int j = 1; j <= secondStr.length; j++) {
                if (firstStr[i - 1] == secondStr[j - 1] && firstStr[i - 1] != SEPARATOR) {
                    LCS[i][j] = LCS[i - 1][j - 1] + 1;
                    solution[i][j] = "diag";
                } else {
                    LCS[i][j] = 0;
                    solution[i][j] = "none";
                }
                if(LCS[i][j] > max) {
                    max = LCS[i][j];
                    maxI = i;
                    maxJ = j;
                }
            }
        }

        System.out.println("Path values:");
        for (int i = 0; i <= firstStr.length; i++) {
            for (int j = 0; j <= secondStr.length; j++) {
                System.out.print(" " + LCS[i][j]);
            }
            System.out.println();
        }

        System.out.println();
        System.out.println("Path recovery:");
        for (int i = 0; i <= firstStr.length; i++) {
            for (int j = 0; j <= secondStr.length; j++) {
                System.out.print(" " + solution[i][j]);
            }
            System.out.println();
        }
        System.out.println();
        System.out.println("max:" + max + " maxI:" + maxI + " maxJ:" + maxJ);

        return printSolution(maxI, maxJ);
    }

    public String printSolution(int i, int j) {
        String answer = "";
        while(i - 1 >= 0 && j - 1 >= 0 && LCS[i][j] != 0) {
            answer = firstStr[i - 1] + answer;
            i--;
            j--;
        }
        System.out.println("Current max solution: " + answer);
        return answer;
    }

    public static void main(String[] args) {
        String firstStr = "x4.printString(\\\"Bianca.()\\\").y1();";
        String secondStr = "sb.printString(\\\"Bianca.()\\\").length();";
        String maxSubstr;
        LongestCommonSequence lcs;
        do {
            lcs = new LongestCommonSequence(firstStr.toCharArray(), secondStr.toCharArray());
            maxSubstr = lcs.find();
            if(maxSubstr.length() != 0) {
                firstStr = firstStr.replace(maxSubstr, "" + LongestCommonSequence.SEPARATOR);
                secondStr = secondStr.replace(maxSubstr, "" + LongestCommonSequence.SEPARATOR);
            }
        }
        while(maxSubstr.length() != 0);

        System.out.println();
        System.out.println("first:" + firstStr + " second: " + secondStr);

        System.out.println("First array: ");
        String[] firstArray = firstStr.split("\\" + SEPARATOR);
        String[] secondArray = secondStr.split("\\" + SEPARATOR);
        for(String s: firstArray) {
            System.out.println(s);
        }
        System.out.println();
        System.out.println("Second array: ");
        for(String s: secondArray) {
            System.out.println(s);
        }
    }
}

My code might not be the most compact but I've written it like that for clarity : 我的代码可能不是最紧凑的,但为了清晰起见,我写了这样的代码:

public static void main(String[] args) throws InterruptedException, FileNotFoundException, ExecutionException {

    String s1 = "x4.printString(\"Bianca.()\").y1();";
    String s2 = "sb.printString(\"Bianca.()\").length();";

    List<String> result = new ArrayList<>();
    result.addAll(getDifferences(s1, s2));
    result.addAll(getDifferences(s2, s1));

    System.out.println(result);
}

public static List<String> getDifferences(String s1, String s2){
    if(s1 == null){
        return Collections.singletonList(s2);
    }
    if(s2 == null){
        return Collections.singletonList(s1);
    }
    int minimalLength = Math.min(s1.length(),s2.length());
    List<String> result = new ArrayList<>();
    StringBuilder buffer = new StringBuilder(); // keep the consecutive differences
    for(int i = 0; i<minimalLength; i++ ){
        char c = s1.charAt(i);
        if(c == s2.charAt(i)){
            if( buffer.length() > 0){
                result.add(buffer.toString());
                buffer = new StringBuilder();
            }
        } else {
            buffer.append(c);
        }
    }
    if(s1.length() > minimalLength){
        buffer.append(s1.substring(minimalLength)); // add the rest
    }
    if(buffer.length() > 0){
        result.add(buffer.toString()); //flush buffer
    }
    return result;
}

However, note that also returns the non-words characters as you didn't specified you wanted to remove them (but they don't figure in your expected output). 但是,请注意,还会返回非单词字符,因为您未指定要删除它们(但它们未在预期输出中显示)。

This is the solution I found, thanks to this link posted by @aUserHimself. 这是我找到的解决方案,感谢@aUserHimself发布的这个链接。

private static int[][] lcs(String s, String t) {
        int m, n;
        m = s.length();
        n = t.length();
        int[][] table = new int[m+1][n+1];
        for (int i=0; i < m+1; i++)
            for (int j=0; j<n+1; j++)
                table[i][j] = 0;
        for (int i = 1; i < m+1; i++)
            for (int j = 1; j < n+1; j++)
                if (s.charAt(i-1) == t.charAt(j-1))
                        table[i][j] = table[i-1][j-1] + 1;
                else
                    table[i][j] = Math.max(table[i][j-1], table[i-1][j]);
        return table;
    }

private static List<List<String>> getDiffs(int[][] table, String s, String t, int i, int j,
                                           int indexS, int indexT, List<List<String>> diffs){
    List<String> sList, tList;
    sList = diffs.get(0);
    tList = diffs.get(1);
    if (i > 0 && j > 0 && (s.charAt(i-1) == t.charAt(j-1)))
        return getDiffs(table, s, t, i-1, j-1, indexS, indexT, diffs);
    else if (i > 0 || j > 0) {
            if (i > 0 && (j == 0 || table[i][j-1] < table[i-1][j])){
                if (i == indexS)
                    sList.set(sList.size()-1, String.valueOf(s.charAt(i-1)) + sList.get(sList.size() - 1));
                else
                    sList.add(String.valueOf(s.charAt(i-1)));
                diffs.set(0, sList);
                return getDiffs(table, s, t, i-1, j, i-1, indexT, diffs);
            }
            else if (j > 0 && (i == 0 || table[i][j-1] >= table[i-1][j])){
                if (j == indexT)
                    tList.set(tList.size() - 1, String.valueOf(t.charAt(j-1)) + tList.get(tList.size()-1));
                else
                    tList.add(String.valueOf(t.charAt(j-1)));
                diffs.set(1, tList);
                return getDiffs(table, s, t, i, j-1, indexS, j-1, diffs);
            }
        }
    return diffs;
}

private static List<List<String>> getAllDiffs(String s, String t){
    List<List<String>> diffs = new ArrayList<List<String>>();
    List<String> l1, l2;
    l1 = new ArrayList<>();
    l2 = new ArrayList<>();
    diffs.add(l1);
    diffs.add(l2);
    return getDiffs(lcs(s, t), s, t, s.length(), t.length(), 0,  0, diffs);
}

I posted because maybe it could be interesting for someone. 我发布了,因为它可能对某人有趣。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM