简体   繁体   English

使用TRE近似于python中的RegEx:奇怪的unicode行为

[英]approximate RegEx in python with TRE: strange unicode behavior

I am trying to use the TRE -library in python to match misspelled input. 我试图在python中使用TRE -library来匹配拼写错误的输入。
It is important, that it does handle utf-8 encoded Strings well. 重要的是,它确实能很好地处理utf-8编码的字符串。

an example: 一个例子:
The German capital's name is Berlin, but from the pronunciation it is the same, if people would write "Bärlin" 德国首都的名字是柏林,但是从发音来看它是一样的,如果人们会写“Bärlin”

It is working so far, but if a non-ASCII character is on the first or second position of the detected String, neither the range nor the detected string itself is correct. 它到目前为止工作,但如果非ASCII字符位于检测到的字符串的第一个或第二个位置,则范围和检测到的字符串本身都不正确。

# -*- coding: utf-8 -*-
import tre

def apro_match(word, list):
    fz = tre.Fuzzyness(maxerr=3)
    pt = tre.compile(word)
    for i in l:
        m = pt.search(i,fz)
        if m:
            print m.groups()[0],' ', m[0]

if __name__ == '__main__':
    string1 = u'Berlín'.encode('utf-8')
    string2 = u'Bärlin'.encode('utf-8')    
    string3 = u'B\xe4rlin'.encode('utf-8')
    string4 = u'Berlän'.encode('utf-8')
    string5 = u'London, Paris, Bärlin'.encode('utf-8')
    string6 = u'äerlin'.encode('utf-8')
    string7 = u'Beälin'.encode('utf-8')

    l = ['Moskau', string1, string2, string3, string4, string5, string6, string7]

    print '\n'*2
    print "apro_match('Berlin', l)"
    print "="*20
    apro_match('Berlin', l)
    print '\n'*2

    print "apro_match('.*Berlin', l)"
    print "="*20
    apro_match('.*Berlin', l)

output 产量

apro_match('Berlin', l)
====================
(0, 7)   Berlín
(1, 7)   ärlin
(1, 7)   ärlin
(0, 7)   Berlän
(16, 22)   ärlin
(1, 7)   ?erlin
(0, 7)   Beälin



apro_match('.*Berlin', l)
====================
(0, 7)   Berlín
(0, 7)   Bärlin
(0, 7)   Bärlin
(0, 7)   Berlän
(0, 22)   London, Paris, Bärlin
(0, 7)   äerlin
(0, 7)   Beälin

Not that for the regex '.*Berlin' it works fine, while for the regex 'Berlin' 不是正则表达式'.*Berlin'它工作正常,而正则表达'Berlin'

u'Bärlin'.encode('utf-8')    
u'B\xe4rlin'.encode('utf-8')
u'äerlin'.encode('utf-8')

are not working, while 没有工作,而

u'Berlín'.encode('utf-8')
u'Berlän'.encode('utf-8')
u'London, Paris, Bärlin'.encode('utf-8')
u'Beälin'.encode('utf-8')

work as expected. 按预期工作。

Is there something I do wrong with the encoding? 我的编码错误了吗? Do you know any trick? 你知道诀窍吗?

You could use new regex library, it supports Unicode 6.0 and fuzzy matching: 您可以使用新的regex库,它支持Unicode 6.0和模糊匹配:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from itertools import ifilter, imap
import regex as re

def apro_match(word_re, lines, fuzzy='e<=1'):
    search = re.compile(ur'('+word_re+'){'+fuzzy+'}').search
    for m in ifilter(None, imap(search, lines)):
        print m.span(), m[0]

def main():
    lst = u'Moskau Berlín Bärlin B\xe4rlin Berlän'.split()
    lst += [u'London, Paris, Bärlin']
    lst += u'äerlin Beälin'.split()
    print
    print "apro_match('Berlin', lst)"
    print "="*25
    apro_match('Berlin', lst)
    print 
    print "apro_match('.*Berlin', lst)"
    print "="*27
    apro_match('.*Berlin', lst)

if __name__ == '__main__':
    main()

'e<=1' means that at most one error of any kind is permitted. 'e<=1'表示最多允许任何类型的一个错误。 There are three types of errors: 有三种类型的错误:

  • Insertion, indicated by "i" 插入,由“i”表示
  • Deletion, indicated by "d" 删除,用“d”表示
  • Substitution, indicated by "s" 替换,由“s”表示

Output 产量

apro_match('Berlin', lst)
=========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(15, 21) Bärlin
(0, 6) äerlin
(0, 6) Beälin

apro_match('.*Berlin', lst)
===========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(0, 21) London, Paris, Bärlin
(0, 6) äerlin
(0, 6) Beälin

Internally TRE works at the byte level and it returns byte positions. 内部TRE在字节级工作,它返回字节位置。 I had your same issue a while ago - there is no trick! 我刚才有同样的问题 - 没有诀窍!

I modified the Python bindings, added an utf8 function and a function which builds a map from byte position to character position, and a small wrapper. 我修改了Python绑定,添加了一个utf8函数和一个从字节位置到字符位置构建映射的函数,以及一个小包装器。 Your test case works as expected when using this wrapper. 使用此包装器时,您的测试用例按预期工作。 I have not released the modifications, it was more of a quick hack while testing TRE - if you want them just let me know. 我没有发布修改,在测试TRE时更多的是快速破解 - 如果你想让它们让我知道。

AFAIK TRE hasn't been updated for quite a while and there are still unfixed bugs in the current release (0.8.0) relating to pattern matching towards the end of a string (eg search "2004 " using pattern "2004$" gives a cost of 2, while the expected cost is 1). AFAIK TRE已经有很长一段时间没有更新了,当前版本(0.8.0)中仍然存在未修复的错误,这些错误与字符串末尾的模式匹配有关(例如,搜索“2004”使用模式“2004 $”给出了成本为2,而预期成本为1)。

As others have pointed out, for Python the new regex module seems quite interesting! 正如其他人所指出的,对于Python来说,新的正则表达式模块看起来非常有趣!

The link that you gave is to a blog article which gives a reference to another blog article about the most recent release which has many grumbly comments including one suggesting that the package doesn't work with "non-Latin" (whatever that means) encodings. 您给出的链接是一篇博客文章,该文章提供了另一篇关于最新版本的博客文章的引用,该文章有许多令人讨厌的评论,其中一条建议该软件包不适用于“非拉丁文”(无论这意味着)编码。 What leads you to believe that TRE works with UTF-8-encoded text (by working at the character level rather than the byte level)? 是什么让你相信TRE使用UTF-8编码的文本(通过在字符级而不是字节级工作)?

You don't tell us how many errors (insertion, deletion, replacement) are accepted as a fuzzy match. 您没有告诉我们有多少错误(插入,删除,替换)被接受为模糊匹配。 You don't tell us if it is using the char routines or the wchar routines. 您没有告诉我们它是否正在使用char例程或wchar例程。 Do you really expect potential answerers to download the package and read the code of the Python interface? 你真的希望潜在的回答者下载包并阅读Python界面的代码吗?

One would expect that if there are wchar C++ routines available, a Python interface would include bindings that did Python unicode <-> Python str (encoded in UTF-16LE) <-> C++ wchar -- not so? 人们会期望,如果有可用的wchar C ++例程,Python接口将包含执行Python unicode的绑定< - > Python str(以UTF-16LE编码)< - > C ++ wchar - 不是这样吗?

Given that "working" matches for 6-character test cases come back with (0, 7), and one not-working case (string 6) is splitting up a two-byte character (prints as a ? because the answer is not valid UTF-8), it seems that it is working in byte (char) encoding-agnostic mode -- not a very good idea at all. 鉴于6个字符的测试用例的“工作”匹配返回(0,7),并且一个不工作的情况(字符串6)正在拆分一个双字节字符(打印为?因为答案无效UTF-8),似乎它在字节(char)编码不可知模式下工作 - 根本不是一个好主意。

Note that if all else fails and all your input data is in German, you could try using latin1 or cp1252 encoding with the byte mode. 请注意,如果所有其他操作都失败并且您的所有输入数据都是德语,则可以尝试使用字节模式的latin1或cp1252编码。

Some further remarks: 一些进一步的评论:

Your string3 is redundant -- it is the same as string2. 你的string3是多余的 - 它与string2相同。

Your assertion that string5 "works" seems to be inconsistent with your assertions that string2 and string3 "work". 你断言string5“工作”似乎与你的断言string2和string3“工作”不一致。

Your test coverage is sparse; 您的测试覆盖率很低; it needs several don't-match cases that are much closer to matching than "Moskau"! 它需要几个不匹配的情况,比“Moskau”更接近匹配!

You should ensure that it is "working" with ASCII-only data first; 您应该确保它首先使用仅ASCII数据“工作”; here are some test cases: 这里有一些测试用例:

Berlxn Berlxyn
Bxrlin Bxyrlin
xerlin xyerlin
Bexlin Bexylin
xBerlin xyBerlin
Bxerlin Bxyerlin
Berlinx Berlinxy
erlin Brlin Berli

Then run it with non-ASCII characters instead of each of x and y` in the above list. 然后使用非ASCII字符运行它,而不是上面列表中的每个x and y`。

Using a pattern like ".*Berlin" is not very useful for diagnostic purposes, especially when you have no meaningful "should not match" test cases. 使用像“。* Berlin”这样的模式对于诊断目的不是很有用,尤其是当你没有有意义的“不应该匹配”的测试用例时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM