比較Python中的兩個文本塊

Question

我有一個可以從各種來源獲得信息的系統。 我想確保我沒有添加確切（或極其相似）的信息。 這是一個例子：

文字A：有一天，一個人走過山坡，看到了陽光

文字B：有一天，一個人走過山坡，看到了陽光

文字C：一個星期，一位女士在山上望着陽光

在這種情況下，我想獲取一些信息塊之間的差異的數值。 從那里，我可以應用以下邏輯：

將文本添加到數據庫時，請檢查數據庫中的現有值
如果發現值非常相似，則不要添加
如果值看起來足夠不同，則添加

因此，我們最終在數據庫中獲得了不同的信息，而不是重復信息，但是我們留有少量余地。

誰能告訴我如何在Python中嘗試這種方法？

Answer 1

查看您的問題， difflib.SequenceMatcher.ratio（）可能會派上用場。

這個漂亮的例程，使用兩個字符串並計算在[0,1]范圍內的相似性索引

快速演示

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

Answer 2

有幾個python庫可以幫助您。 看看這個問：。

Levisthein距離是一種常用算法。 我發現nysiis算法非常有用。 尤其是要在數據庫中保存字符串表示形式時。

該鏈接將為您提供出色的概述：

Answer 3

一種原始的方式...但是您可以遍歷字符串，比較另一個字符串中的等效順序單詞，然后得到匹配與失敗的比率：

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

因此，在此示例中，您可以看到匹配的11/12個單詞。 然后，您可以設置通過/失敗級別

Answer 4

在python或任何其他語言中，散列是刪除重復項的最簡單方法。

您可以維護一個已添加哈希表。 當您添加另一個時，只需檢查是否存在哈希。

為此使用hashlib

添加hashlib用法示例

import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()

m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()

m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()

答

d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b

比較Python中的兩個文本塊

問題描述

4 個解決方案

解決方案1
2 已采納 2013-08-22 11:35:48

快速演示

解決方案2
1 2013-08-22 11:26:22

解決方案3
1 2013-08-22 11:27:19

解決方案4
0 2013-08-22 11:28:19

比較Python中的兩個文本塊

問題描述

4 個解決方案

解決方案1 2 已采納 2013-08-22 11:35:48

快速演示

解決方案2 1 2013-08-22 11:26:22

解決方案3 1 2013-08-22 11:27:19

解決方案4 0 2013-08-22 11:28:19

解決方案1
2 已采納 2013-08-22 11:35:48

解決方案2
1 2013-08-22 11:26:22

解決方案3
1 2013-08-22 11:27:19

解決方案4
0 2013-08-22 11:28:19