简体   繁体   English

比较Python中的两个文本块

[英]Comparing two blocks of text in Python

I have a system where information can come from various sources. 我有一个可以从各种来源获得信息的系统。 I want to make sure I don't add exact (or extremely similar) pieces of information. 我想确保我没有添加确切(或极其相似)的信息。 Here is an example: 这是一个例子:

Text A: One day a man walked over the hill and saw the sun 文字A:有一天,一个人走过山坡,看到了阳光

Text B: One day a man walked over a hill and saw the sun 文字B:有一天,一个人走过山坡,看到了阳光

Text C: One week a woman looked over a hill and saw the sun 文字C:一个星期,一位女士在山上望着阳光

In this case I want to get some sort of numerical value for the difference between the blocks of information. 在这种情况下,我想获取一些信息块之间的差异的数值。 From there I can apply the following logic: 从那里,我可以应用以下逻辑:

  1. When adding Text to database, check for existing values in database 将文本添加到数据库时,请检查数据库中的现有值
  2. If values are seen to be very similar then do not add 如果发现值非常相似,则不要添加
  3. If values are seen to be different enough, then do add 如果值看起来足够不同,则添加

Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway. 因此,我们最终在数据库中获得了不同的信息,而不是重复信息,但是我们留有少量余地。

Can anyone tell me how I might attempt this in Python? 谁能告诉我如何在Python中尝试这种方法?

Looking at your problem, difflib.SequenceMatcher.ratio() might come handy. 查看您的问题, difflib.SequenceMatcher.ratio()可能会派上用场。

This nifty routine, takes two strings and calculates a similarity index in the range [0,1] 这个漂亮的例程,使用两个字符串并计算在[0,1]范围内的相似性索引

Quick Demo 快速演示

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

There are a couple of python libraries that can help you with that. 有几个python库可以帮助您。 Have a look at this Q: . 看看这个问:

The levisthein distance is a common algorithm. Levisthein距离是一种常用算法。 I found the nysiis algorithm very useful. 我发现nysiis算法非常有用。 Especially if you want to save a string representation in a DB. 尤其是要在数据库中保存字符串表示形式时。

This link will give you an excellent overview: 链接将为您提供出色的概述:

A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails: 一种原始的方式...但是您可以遍历字符串,比较另一个字符串中的等效顺序单词,然后得到匹配与失败的比率:

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

So in this example, you can see 11/12 words matched. 因此,在此示例中,您可以看到匹配的11/12个单词。 You can then set a pass / fail level 然后,您可以设置通过/失败级别

In python or any other language hashes are the easiest way to remove duplicates. 在python或任何其他语言中,散列是删除重复项的最简单方法。

You can maintain a table of already added hashes. 您可以维护一个已添加哈希表。 when you add another just check if hash is present or not. 当您添加另一个时,只需检查是否存在哈希。

Use hashlib for it 为此使用hashlib

Adding hashlib usage example 添加hashlib用法示例

import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()

m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()

m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()

Ans

d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM