简体   繁体   English

Python - 两个字符串之间的区别

[英]Python - difference between two strings

I'd like to store a lot of words in a list.我想在列表中存储很多单词。 Many of these words are very similar.其中许多词非常相似。 For example I have word afrykanerskojęzyczny and many of words like afrykanerskojęzycznym , afrykanerskojęzyczni , nieafrykanerskojęzyczni .例如,我有单词afrykanerskojęzyczny和许多单词,例如afrykanerskojęzycznymafrykanerskojęzyczninieafrykanerskojęzyczni What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?找到两个字符串之间的差异并从第一个字符串和 diff 恢复第二个字符串的有效(快速且差异大小)解决方案是什么?

You can use ndiff in the difflib module to do this.您可以在 difflib 模块中使用ndiff来执行此操作。 It has all the information necessary to convert one string into another string.它具有将一个字符串转换为另一个字符串所需的所有信息。

A simple example:一个简单的例子:

import difflib

cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),
       ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
       ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
       ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
       ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
       ('abcdefg','xac')] 

for a,b in cases:     
    print('{} => {}'.format(a,b))  
    for i,s in enumerate(difflib.ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    
    print()      

prints:印刷:

afrykanerskojęzyczny => afrykanerskojęzycznym
Add "m" to position 20

afrykanerskojęzyczni => nieafrykanerskojęzyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2

afrykanerskojęzycznym => afrykanerskojęzyczny
Delete "m" from position 20

nieafrykanerskojęzyczni => afrykanerskojęzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2

nieafrynerskojęzyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "ę" from position 16

abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7

I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:我喜欢 ndiff 答案,但是如果您想将其全部吐出仅包含更改的列表,则可以执行以下操作:

import difflib

case_a = 'afrykbnerskojęzyczny'
case_b = 'afrykanerskojęzycznym'

output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']

You can look into the regex module (the fuzzy section).您可以查看正则表达式模块(模糊部分)。 I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:我不知道您是否可以获得实际差异,但至少您可以指定允许的不同类型更改的数量,例如插入、删除和替换:

import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni', 
            'nieafrykanerskojezyczni' ]
for q in queries:
    m = regex.search(r'(%s){e<=2}'%q, sequence)
    print 'match' if m else 'nomatch'

What you are asking for is a specialized form of compression.您要求的是一种特殊的压缩形式。 xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. xdelta3是为这种特殊类型的压缩而设计的,它有一个 python 绑定,但你可能可以直接使用 zlib。 You'd want to use zlib.compressobj and zlib.decompressobj with the zdict parameter set to your "base word", eg afrykanerskojęzyczny .您想使用zlib.compressobjzlib.decompressobj并将zdict参数设置为您的“基本词”,例如afrykanerskojęzyczny

Caveats are zdict is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.注意事项是zdict仅在 python 3.3 及更高版本中受支持,如果您的所有差异都具有相同的“基本词”,则最容易编码,这可能是您想要的,也可能不是。

You might find the tools available in the NLTK library useful for calculating the difference between different words.您可能会发现NLTK中可用的工具可用于计算不同单词之间的差异。

nltk.metrics.distance.edit_distance() is a mature (non-standard) library implementation that calculates the Levenshtein distance nltk.metrics.distance.edit_distance()是一个成熟的(非标准)库实现,用于计算Levenshtein 距离

A simple example might be:一个简单的例子可能是:

from nltk.metrics.distance import *

w1 = 'wordone'
w2 = 'wordtwo'
edit_distance(w1, w2)

Out: 3

Additional parameter allow the output to be weighted, depending on the costs of different actions (substitutions/insertions) and different character differences (eg less cost for characters closer on the keyboard).附加参数允许对输出进行加权,具体取决于不同操作(替换/插入)和不同字符差异的成本(例如,靠近键盘的字符成本更低)。

The answer to my comment above on the Original Question makes me think this is all he wants:我上面对原始问题的评论的答案让我认为这就是他想要的:

loopnum = 0
word = 'afrykanerskojęzyczny'
wordlist = ['afrykanerskojęzycznym','afrykanerskojęzyczni','nieafrykanerskojęzyczni']
for i in wordlist:
    wordlist[loopnum] = word
    loopnum += 1

This will do the following:这将执行以下操作:

For every value in wordlist, set that value of the wordlist to the origional code.对于 wordlist 中的每个值,将 wordlist 的值设置为原始代码。

All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.您所要做的就是将这段代码放在您需要更改的单词列表中,确保将需要更改的单词存储在单词列表中,并且原始单词是正确的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM