[英]How to compare two paragraphs of text?
I need to remove duplicated paragraphs in a text with many paragraphs. 我需要删除包含许多段落的文本中的重复段落。
I use functions from the class java.security.MessageDigest
to calculate each paragraph's MD5 hash value, and then add these hash value into a Set
. 我使用类
java.security.MessageDigest
函数来计算每个段落的MD5哈希值,然后将这些哈希值添加到Set
。
If add()
'ed successfully, it means the latest paragraph is a duplicate one. 如果
add()
'ed成功,则表示最新段落是重复段落。
Is there any risk of this way? 这种方式有风险吗?
Except String.equals()
, is there any other way to do it? 除了
String.equals()
,有没有其他方法可以做到这一点?
Before hashing you could normalize the paragraphs eg Removing punctuation, conversion to lower case and removing additional whitespace. 在散列之前,您可以规范化段落,例如删除标点,转换为小写并删除其他空格。 After normalizing, paragraphs that only differ there would get the same hash.
规范化后,只有不同的段落会得到相同的哈希值。
If the MD5 hash is not yet in the set, it means the paragraph is unique. 如果MD5哈希尚未在集合中,则表示该段落是唯一的。 But the opposite is not true.
但事实恰恰相反。 So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value.
因此,如果您发现散列已经在集合中,则可能具有相同散列值的非重复。 This would be very unlikely, but you'll have to test that paragraph against all others to be sure.
这是不太可能的,但你必须对所有其他人测试该段以确定。 For that String.equals would do.
对于String.equals会这样做。
Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method. 此外,你应该很好地考虑你所谓的独特之处(关于拼写错误,空格,大写字母等),但任何方法都是如此。
There's no need to calculate the MD5 hash, just use a HashSet
and try to put the strings itself into this set. 没有必要计算MD5哈希值,只需使用
HashSet
并尝试将字符串本身放入此集合中。 This will use the String#hashCode()
method to compute a hash value for the String and check if it's already in the set. 这将使用
String#hashCode()
方法计算String的哈希值,并检查它是否已经在集合中。
public Set removeDuplicates(String[] paragraphs) {
Set<String> set = new LinkedHashSet<String>();
for (String p : paragraphs) {
set.add(p);
}
return set;
}
Using a LinkedHashSet
even keeps the original order of the paragraphs. 使用
LinkedHashSet
甚至可以保留段落的原始顺序。
As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same. 正如其他人所建议的那样,你应该知道标点符号,空格,换行符等的微小差异可能会使你的哈希值与基本相同的段落不同。
Perhaps you should consider a less brittle metric, such as eg. 也许您应该考虑一个不太脆弱的指标,例如。 the Cosine Similarity which is well suited for matching paragraphs.
余弦相似度非常适合匹配段落。
Cheers, 干杯,
I think this is a good way. 我认为这是一个好方法。 However, there are some things to keep in mind:
但是,有一些事情要记住:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.