[英]Comparing two blocks of text in Python
我有一個可以從各種來源獲得信息的系統。 我想確保我沒有添加確切(或極其相似)的信息。 這是一個例子:
文字A:有一天,一個人走過山坡,看到了陽光
文字B:有一天,一個人走過山坡,看到了陽光
文字C:一個星期,一位女士在山上望着陽光
在這種情況下,我想獲取一些信息塊之間的差異的數值。 從那里,我可以應用以下邏輯:
因此,我們最終在數據庫中獲得了不同的信息,而不是重復信息,但是我們留有少量余地。
誰能告訴我如何在Python中嘗試這種方法?
查看您的問題, difflib.SequenceMatcher.ratio()可能會派上用場。
這個漂亮的例程,使用兩個字符串並計算在[0,1]范圍內的相似性索引
>>> for a,b in list(itertools.product(st, st)):
print "Text 1 {}".format(a)
print "Text 2 {}".format(b)
print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
print '-'*80
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
一種原始的方式...但是您可以遍歷字符串,比較另一個字符串中的等效順序單詞,然后得到匹配與失敗的比率:
>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12
因此,在此示例中,您可以看到匹配的11/12個單詞。 然后,您可以設置通過/失敗級別
在python或任何其他語言中,散列是刪除重復項的最簡單方法。
您可以維護一個已添加哈希表。 當您添加另一個時,只需檢查是否存在哈希。
為此使用hashlib
添加hashlib用法示例
import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()
m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()
m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()
答
d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.