简体   繁体   English

定制的非ASCII字符标记

[英]Customized non-ascii characters flagger

I've looked around for a custom-made solution, but I couldn't find a solution for a use case that I am facing. 我一直在寻找定制的解决方案,但是找不到我要面对的用例的解决方案。

Use Case 用例

I'm building a 'website' QA test where the script will go through a bulk of HTML documents, and identify any rogue characters. 我正在建立一个“网站”质量检查测试,该脚本将通过大量HTML文档,并识别任何流氓字符。 I cannot use pure non-ascii method since the HTML documents contain characters such as ">" and other minor characters. 我不能使用纯非ascii方法,因为HTML文档包含诸如“>”之类的字符和其他次要字符。 Therefore, I am building up a unicode rainbow dictionary that identifies some of the common non-ascii characters that my team and I frequently see. 因此,我正在构建一个unicode Rainbow字典,该字典标识我和我的团队经常看到的一些常见的非ascii字符。 The following is my Python code. 以下是我的Python代码。

# -*- coding: utf-8 -*-

import re

unicode_rainbow_dictionary = {
    u'\u00A0':' ',
    u'\uFB01':'fi',
}

strings = ["This contains the annoying non-breaking space","This is fine!","This is not fine!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print "Epic fail! There is a rogue character in '"+string+"'"
        else:
            print string

The issue here is that the last string in the strings array contains a non-ascii ligature character (the combined fi). 这里的问题是,字符串数组中的最后一个字符串包含一个非ASCII连字字符(组合的fi)。 When I run this script, it doesn't capture the ligature character, but it captures the non-breakable space character in the first case. 当我运行此脚本时,它不会捕获连字字符,但在第一种情况下会捕获不可中断的空格字符。

What is leading to the false positive? 是什么导致误报?

Use Unicode strings for all text as @jgfoot points out. @jgfoot指出,对所有文本使用Unicode字符串。 The easiest way to do this is to use from __future__ to default to Unicode literals for strings. 最简单的方法是使用from __future__将字符串默认为Unicode文字。 Additionally, using print as a function will make the code Python 2/3 compatible: 此外,使用print作为函数将使代码与Python 2/3兼容:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re

unicode_rainbow_dictionary = {
    '\u00A0':' ',
    '\uFB01':'fi',
}

strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not fine!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print("Epic fail! There is a rogue character in '"+string+"'")
        else:
            print(string)

If you have the possibility then switch to Python 3 as soon as possible! 如果有可能,请尽快切换到Python 3! Python 2 is not good at handling unicode whereas Python 3 does it natively. Python 2不擅长处理unicode,而Python 3则本机处理。

for string in strings:
    for character in unicode_rainbow_dictionary:
        if character in string:
            print("Rogue character '" + character + "' in '" + string + "'")

I couldn't get the non-breaking space to occur in my test. 我无法获得不间断的空间来进行测试。 I got around that by using "This contains the annoying" + chr(160) + "non-breaking space" after which it matched. 我通过使用"This contains the annoying" + chr(160) + "non-breaking space"该问题,然后将其匹配。

Your code doesn't work as expected because, in your "strings" variable, you have unicode characters in non-unicode strings. 您的代码无法正常工作,因为在“字符串”变量中,非Unicode字符串中包含Unicode字符。 You forgot to put the "u" in front of them to signal that they should be treated as unicode strings. 您忘记将“ u”放在它们前面以表示应将它们视为unicode字符串。 So, when you search for a unicode string inside a non-unicode string, it doesn't work as expected 因此,当您在非Unicode字符串中搜索Unicode字符串时,它无法按预期工作

If you change this to: 如果将其更改为:

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not fine!"]

It works as expected. 它按预期工作。

Solving unicode headaches like this is a major benefit of Python 3. 像这样解决unicode头痛是Python 3的一大好处。

Here's an alternative approach to your problem. 这是解决问题的另一种方法。 How about just trying to encode the string as ASCII, and catching errors if it doesn't work?: 仅尝试将字符串编码为ASCII并在不起作用的情况下捕获错误该怎么办?:

def is_this_ascii(s):
    try:
        ignore = unicode(s).encode("ascii")
        return True
    except (UnicodeEncodeError, UnicodeDecodeError):
        return False

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not fine!"]

for s in strings:
    print(repr(is_this_ascii(s)))

##False
##True
##False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM