简体   繁体   English

查找两个字符串之间的差异

[英]Find difference between two Strings

Suppose I have two long strings.假设我有两个长字符串。 They are almost same.他们几乎是一样的。

String a = "this is a example"
String b = "this is a examp"

Above code is just for example.以上代码仅作为示例。 Actual strings are quite long.实际字符串很长。

Problem is one string have 2 more characters than the other.问题是一个字符串比另一个字符串多2 个字符

How can I check which are those two character?我怎样才能检查这两个字符是哪一个?

You can use StringUtils.difference(String first, String second) .您可以使用StringUtils.difference(String first, String second)

This is how they implemented it:他们是这样实现的:

public static String difference(String str1, String str2) {
    if (str1 == null) {
        return str2;
    }
    if (str2 == null) {
        return str1;
    }
    int at = indexOfDifference(str1, str2);
    if (at == INDEX_NOT_FOUND) {
        return EMPTY;
    }
    return str2.substring(at);
}

public static int indexOfDifference(CharSequence cs1, CharSequence cs2) {
    if (cs1 == cs2) {
        return INDEX_NOT_FOUND;
    }
    if (cs1 == null || cs2 == null) {
        return 0;
    }
    int i;
    for (i = 0; i < cs1.length() && i < cs2.length(); ++i) {
        if (cs1.charAt(i) != cs2.charAt(i)) {
            break;
        }
    }
    if (i < cs2.length() || i < cs1.length()) {
        return i;
    }
    return INDEX_NOT_FOUND;
}

To find the difference between 2 Strings you can use StringUtils class and the difference method.要查找 2 个字符串之间的差异,您可以使用StringUtils类和差异方法。 It compares the two Strings, and returns the portion where they differ.它比较两个字符串,并返回它们不同的部分。

 StringUtils.difference(null, null) = null
 StringUtils.difference("", "") = ""
 StringUtils.difference("", "abc") = "abc"
 StringUtils.difference("abc", "") = ""
 StringUtils.difference("abc", "abc") = ""
 StringUtils.difference("ab", "abxyz") = "xyz"
 StringUtils.difference("abcde", "abxyz") = "xyz"
 StringUtils.difference("abcde", "xyz") = "xyz"

See: https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html请参阅: https : //commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html

Without iterating through the strings you can only know that they are different, not where - and that only if they are of different length.不反复通过串,你只能知道他们是不同的,不是哪里-而只有当他们具有不同的长度。 If you really need to know what the different characters are, you must step through both strings in tandem and compare characters at the corresponding places.如果您确实需要知道不同的字符是什么,则必须串联遍历两个字符串并比较相应位置的字符。

The following Java snippet efficiently computes a minimal set of characters that have to be removed from (or added to) the respective strings in order to make the strings equal.以下 Java 代码段有效地计算了必须从相应字符串中删除(或添加到)以使字符串相等的最小字符集。 It's an example of dynamic programming.这是动态规划的一个例子。

import java.util.HashMap;
import java.util.Map;

public class StringUtils {

    /**
     * Examples
     */
    public static void main(String[] args) {
        System.out.println(diff("this is a example", "this is a examp")); // prints (le,)
        System.out.println(diff("Honda", "Hyundai")); // prints (o,yui)
        System.out.println(diff("Toyota", "Coyote")); // prints (Ta,Ce)
        System.out.println(diff("Flomax", "Volmax")); // prints (Fo,Vo)
    }

    /**
     * Returns a minimal set of characters that have to be removed from (or added to) the respective
     * strings to make the strings equal.
     */
    public static Pair<String> diff(String a, String b) {
        return diffHelper(a, b, new HashMap<>());
    }

    /**
     * Recursively compute a minimal set of characters while remembering already computed substrings.
     * Runs in O(n^2).
     */
    private static Pair<String> diffHelper(String a, String b, Map<Long, Pair<String>> lookup) {
        long key = ((long) a.length()) << 32 | b.length();
        if (!lookup.containsKey(key)) {
            Pair<String> value;
            if (a.isEmpty() || b.isEmpty()) {
                value = new Pair<>(a, b);
            } else if (a.charAt(0) == b.charAt(0)) {
                value = diffHelper(a.substring(1), b.substring(1), lookup);
            } else {
                Pair<String> aa = diffHelper(a.substring(1), b, lookup);
                Pair<String> bb = diffHelper(a, b.substring(1), lookup);
                if (aa.first.length() + aa.second.length() < bb.first.length() + bb.second.length()) {
                    value = new Pair<>(a.charAt(0) + aa.first, aa.second);
                } else {
                    value = new Pair<>(bb.first, b.charAt(0) + bb.second);
                }
            }
            lookup.put(key, value);
        }
        return lookup.get(key);
    }

    public static class Pair<T> {
        public Pair(T first, T second) {
            this.first = first;
            this.second = second;
        }

        public final T first, second;

        public String toString() {
            return "(" + first + "," + second + ")";
        }
    }
}

To directly get only the changed section, and not just the end, you can use Google's Diff Match Patch .要直接获得更改的部分,而不仅仅是结尾,您可以使用 Google 的Diff Match Patch

List<Diff> diffs = new DiffMatchPatch().diffMain("stringend", "stringdiffend");
for (Diff diff : diffs) {
  if (diff.operation == Operation.INSERT) {
    return diff.text; // Return only single diff, can also find multiple based on use case
  }
}

For Android, add: implementation 'org.bitbucket.cowwoc:diff-match-patch:1.2'对于 Android,添加: implementation 'org.bitbucket.cowwoc:diff-match-patch:1.2'

This package is far more powerful than just this feature, it is mainly used for creating diff related tools.这个包远不止这个功能强大,它主要用于创建diff相关的工具。

String strDiffChop(String s1, String s2) {
    if (s1.length > s2.length) {
        return s1.substring(s2.length - 1);
    } else if (s2.length > s1.length) {
        return s2.substring(s1.length - 1);
    } else {
        return null;
    }
}

Google's Diff Match Patch is good, but it was a pain to install into my Java maven project. Google 的 Diff Match Patch 很好,但是安装到我的 Java maven 项目中很痛苦。 Just adding a maven dependency did not work;仅仅添加一个 maven 依赖是行不通的; eclipse just created the directory and added the lastUpdated info files. eclipse 刚刚创建了目录并添加了 lastUpdated 信息文件。 Finally, on the third try, I added the following to my pom:最后,在第三次尝试时,我在 pom 中添加了以下内容:

<dependency>
    <groupId>fun.mike</groupId>
     <artifactId>diff-match-patch</artifactId>
    <version>0.0.2</version>
</dependency>

Then I manually placed the jar and source jar files into my .m2 repo from https://search.maven.org/search?q=g:fun.mike%20AND%20a:diff-match-patch%20AND%20v:0.0.2然后我从https://search.maven.org/search?q=g:fun.mike%20AND%20a:diff-match-patch%20AND%20v手动将 jar 和源 jar 文件放入我的 .m2 存储库中: 0.0.2

After all that, the following code worked:毕竟,以下代码有效:

import fun.mike.dmp.Diff;
import fun.mike.dmp.DiffMatchPatch;

DiffMatchPatch dmp = new DiffMatchPatch();
LinkedList<Diff> diffs = dmp.diff_main("Hello World.", "Goodbye World.");
System.out.println(diffs);

The result:结果:

[Diff(DELETE,"Hell"), Diff(INSERT,"G"), Diff(EQUAL,"o"), Diff(INSERT,"odbye"), Diff(EQUAL," World.")]

Obviously, this was not originally written (or even ported fully) into Java.显然,这不是最初编写(甚至完全移植)到 Java 中的。 (diff_main? I can feel the C burning into my eyes :-) ) Still, it works. (diff_main?我能感觉到 C 在我眼中燃烧:-) ) 尽管如此,它仍然有效。 And for people working with long and complex strings, it can be a valuable tool.对于使用长而复杂的字符串的人来说,它可能是一个有价值的工具。

To find the words that are different in the two lines, one can use the following code.要查找两行中不同的单词,可以使用以下代码。

    String[] strList1 = str1.split(" ");
    String[] strList2 = str2.split(" ");

    List<String> list1 = Arrays.asList(strList1);
    List<String> list2 = Arrays.asList(strList2);

    // Prepare a union
    List<String> union = new ArrayList<>(list1);
    union.addAll(list2);

    // Prepare an intersection
    List<String> intersection = new ArrayList<>(list1);
    intersection.retainAll(list2);

    // Subtract the intersection from the union
    union.removeAll(intersection);

    for (String s : union) {
        System.out.println(s);
    }

In the end, you will have a list of words that are different in both the lists.最后,您将获得两个列表中不同的单词列表。 One can modify it easily to simply have the different words in the first list or the second list and not simultaneously.可以很容易地修改它以简单地在第一个列表或第二个列表中包含不同的单词,而不是同时包含不同的单词。 This can be done by removing the intersection from only from list1 or list2 instead of the union.这可以通过仅从 list1 或 list2 而不是联合中删除交集来完成。

Computing the exact location can be done by adding up the lengths of each word in the split list (along with the splitting regex) or by simply doing String.indexOf("subStr").可以通过将拆分列表中每个单词的长度(连同拆分正则表达式)相加或简单地执行 String.indexOf("subStr") 来计算确切位置。

On top of using StringUtils.difference(String first, String second) as seen in other answers, you can also use StringUtils.indexOfDifference(String first, String second) to get the index of where the strings start to differ.除了在其他答案中看到的使用StringUtils.difference(String first, String second)之外,您还可以使用StringUtils.indexOfDifference(String first, String second)来获取字符串开始不同之处的索引。 Ex:前任:

StringUtils.indexOfDifference("abc", "dabc") = 0
StringUtils.indexOfDifference("abc", "abcd") = 3

where 0 is used as the starting index.其中 0 用作起始索引。

Another great library for discovering the difference between strings is DiffUtils at https://github.com/java-diff-utils .另一个用于发现字符串之间差异的优秀库是https://github.com/java-diff-utils上的 DiffUtils。 I used Dmitry Naumenko's fork:我使用了 Dmitry Naumenko 的叉子:

public void testDiffChange() {
    final List<String> changeTestFrom = Arrays.asList("aaa", "bbb", "ccc");
    final List<String> changeTestTo = Arrays.asList("aaa", "zzz", "ccc");
    System.out.println("changeTestFrom=" + changeTestFrom);
    System.out.println("changeTestTo=" + changeTestTo);
    final Patch<String> patch0 = DiffUtils.diff(changeTestFrom, changeTestTo);
    System.out.println("patch=" + Arrays.toString(patch0.getDeltas().toArray()));

    String original = "abcdefghijk";
    String badCopy =  "abmdefghink";
    List<Character> originalList = original
            .chars() // Convert to an IntStream
            .mapToObj(i -> (char) i) // Convert int to char, which gets boxed to Character
            .collect(Collectors.toList()); // Collect in a List<Character>
    List<Character> badCopyList = badCopy.chars().mapToObj(i -> (char) i).collect(Collectors.toList());
    System.out.println("original=" + original);
    System.out.println("badCopy=" + badCopy);
    final Patch<Character> patch = DiffUtils.diff(originalList, badCopyList);
    System.out.println("patch=" + Arrays.toString(patch.getDeltas().toArray()));
}

The results show exactly what changed where (zero based counting):结果准确显示了更改的位置(从零开始计数):

changeTestFrom=[aaa, bbb, ccc]
changeTestTo=[aaa, zzz, ccc]
patch=[[ChangeDelta, position: 1, lines: [bbb] to [zzz]]]
original=abcdefghijk
badCopy=abmdefghink
patch=[[ChangeDelta, position: 2, lines: [c] to [m]], [ChangeDelta, position: 9, lines: [j] to [n]]]

For a simple use case like this.对于像这样的简单用例。 You can check the sizes of the string and use the split function. For your example您可以检查字符串的大小并使用拆分 function。例如

a.split(b)[1]

I think the Levenshtein algorithm and the 3rd party libraries brought out for this very simple (and perhaps poorly stated?) test case are WAY overblown.我认为 Levenshtein 算法和第 3 方库为这个非常简单(并且可能表述不当?)的测试用例而被夸大了。

Assuming your example does not suggest the two bytes are always different at the end, I'd suggest the JDK's Arrays.mismatch( byte[], byte[] ) to find the first index where the two bytes differ.假设您的示例并不表明最后两个字节总是不同,我建议 JDK 的Arrays.mismatch( byte[], byte[] )找到两个字节不同的第一个索引。

    String longer  = "this is a example";
    String shorter = "this is a examp";
    int differencePoint = Arrays.mismatch( longer.toCharArray(), shorter.toCharArray() );
    System.out.println( differencePoint );

You could now repeat the process if you suspect the second character is further along in the String.如果您怀疑第二个字符在字符串中更远,您现在可以重复该过程。

Or, if as you suggest in your example the two characters are together, there is nothing further to do.或者,如果正如您在示例中所建议的那样,两个角色在一起,则无需再做任何事情。 Your answer then would be:那么你的答案将是:

    System.out.println( longer.charAt( differencePoint ) );
    System.out.println( longer.charAt( differencePoint + 1 ) );

If your string contains characters outside of the Basic Multilingual Plane - for example emoji - then you have to use a different technique.如果您的字符串包含基本多语言平面之外的字符——例如表情符号——那么您必须使用不同的技术。 For example,例如,

    String a = "a 🐣 is cuter than a 🐇.";
    String b = "a 🐣 is cuter than a 🐹.";
    int firstDifferentChar      = Arrays.mismatch( a.toCharArray(), b.toCharArray() );
    int firstDifferentCodepoint = Arrays.mismatch( a.codePoints().toArray(), b.codePoints().toArray() );
    System.out.println( firstDifferentChar );       // prints 22!
    System.out.println( firstDifferentCodepoint );  // prints 20, which is correct.
    System.out.println( a.codePoints().toArray()[ firstDifferentCodepoint ] ); // prints out 128007
    System.out.println( new String( Character.toChars( 128007 ) ) ); // this prints the rabbit glyph.

You may try this你可以试试这个

String a = "this is a example";
String b = "this is a examp";

String ans= a.replace(b, "");

System.out.print(now);      
//ans=le

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM