简体   繁体   中英

Comparing two blocks of text in Python

I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:

Text A: One day a man walked over the hill and saw the sun

Text B: One day a man walked over a hill and saw the sun

Text C: One week a woman looked over a hill and saw the sun

In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:

  1. When adding Text to database, check for existing values in database
  2. If values are seen to be very similar then do not add
  3. If values are seen to be different enough, then do add

Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.

Can anyone tell me how I might attempt this in Python?

Looking at your problem, difflib.SequenceMatcher.ratio() might come handy.

This nifty routine, takes two strings and calculates a similarity index in the range [0,1]

Quick Demo

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

There are a couple of python libraries that can help you with that. Have a look at this Q: .

The levisthein distance is a common algorithm. I found the nysiis algorithm very useful. Especially if you want to save a string representation in a DB.

This link will give you an excellent overview:

A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails:

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

So in this example, you can see 11/12 words matched. You can then set a pass / fail level

In python or any other language hashes are the easiest way to remove duplicates.

You can maintain a table of already added hashes. when you add another just check if hash is present or not.

Use hashlib for it

Adding hashlib usage example

import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()

m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()

m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()

Ans

d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM