简体   繁体   English

如何比较两段文字?

[英]How to compare two paragraphs of text?

I need to remove duplicated paragraphs in a text with many paragraphs. 我需要删除包含许多段落的文本中的重复段落。

I use functions from the class java.security.MessageDigest to calculate each paragraph's MD5 hash value, and then add these hash value into a Set . 我使用类java.security.MessageDigest函数来计算每个段落的MD5哈希值,然后将这些哈希值添加到Set

If add() 'ed successfully, it means the latest paragraph is a duplicate one. 如果add() 'ed成功,则表示最新段落是重复段落。

Is there any risk of this way? 这种方式有风险吗?

Except String.equals() , is there any other way to do it? 除了String.equals() ,有没有其他方法可以做到这一点?

Before hashing you could normalize the paragraphs eg Removing punctuation, conversion to lower case and removing additional whitespace. 在散列之前,您可以规范化段落,例如删除标点,转换为小写并删除其他空格。 After normalizing, paragraphs that only differ there would get the same hash. 规范化后,只有不同的段落会得到相同的哈希值。

If the MD5 hash is not yet in the set, it means the paragraph is unique. 如果MD5哈希尚未在集合中,则表示该段落是唯一的。 But the opposite is not true. 但事实恰恰相反。 So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value. 因此,如果您发现散列已经在集合中,则可能具有相同散列值的非重复。 This would be very unlikely, but you'll have to test that paragraph against all others to be sure. 这是不太可能的,但你必须对所有其他人测试该段以确定。 For that String.equals would do. 对于String.equals会这样做。

Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method. 此外,你应该很好地考虑你所谓的独特之处(关于拼写错误,空格,大写字母等),但任何方法都是如此。

There's no need to calculate the MD5 hash, just use a HashSet and try to put the strings itself into this set. 没有必要计算MD5哈希值,只需使用HashSet并尝试将字符串本身放入此集合中。 This will use the String#hashCode() method to compute a hash value for the String and check if it's already in the set. 这将使用String#hashCode()方法计算String的哈希值,并检查它是否已经在集合中。

public Set removeDuplicates(String[] paragraphs) {
    Set<String> set = new LinkedHashSet<String>();
    for (String p : paragraphs) {
        set.add(p);
    }
    return set;
}

Using a LinkedHashSet even keeps the original order of the paragraphs. 使用LinkedHashSet甚至可以保留段落的原始顺序。

As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same. 正如其他人所建议的那样,你应该知道标点符号,空格,换行符等的微小差异可能会使你的哈希值与基本相同的段落不同。

Perhaps you should consider a less brittle metric, such as eg. 也许您应该考虑一个不太脆弱的指标,例如。 the Cosine Similarity which is well suited for matching paragraphs. 余弦相似度非常适合匹配段落。

Cheers, 干杯,

I think this is a good way. 我认为这是一个好方法。 However, there are some things to keep in mind: 但是,有一些事情要记住:

  1. Please note that calculating a hash is a heavy operation. 请注意,计算哈希是一项繁重的操作。 This could render your program slow, if you had to repeat it for millions of paragraphs. 如果您不得不重复数百万段,这可能会使您的程序变慢。
  2. Even in this way, you could end up with slightly different paragraphs (with typos, for examplo) going undetecetd. 即使以这种方式,你可能会得到一些略有不同的段落(错别字段,例如错别字)。 If this is the case, you should normalize the paragraphs before calculaing the hash (putting it into lower case, removing extra-spaces and so on). 如果是这种情况,您应该在计算哈希值之前对段落进行规范化(将其放入小写,删除多余空格等)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM