使用 Python 的正则表达式中的重音字符

Question

This is my code这是我的代码

# -*- coding: utf-8 -*-
import json
import re

with open("/Users/paul/Desktop/file.json") as json_file:
    file = json.load(json_file)
print file["desc"]

key="capacità"
result = re.findall("((?:[\S,]+\s+){0,3})"+key+"\s+((?:[\S,]+\s*){0,3})", file["desc"], re.IGNORECASE)
print result

This is the content of the file这是文件的内容

{
    "desc": "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"
}

My result is []我的结果是 []

but what I want is result = "capacità"但我想要的是 result = "capacità"

Answer 1

You need to treat your string as an Unicode string, like this:您需要将字符串视为 Unicode 字符串，如下所示：

str = u"Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"

And as you can see if you print str.encode('utf-8') you'll get:正如你所看到的，如果你print str.encode('utf-8')你会得到：

Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+

The same way you can make your regex string an unicode or raw string with u or r respectively.同样，您可以分别使用u或r使正则表达式字符串成为 unicode 或原始字符串。

Answer 2

You can use this function to display different encodings.您可以使用此功能来显示不同的编码。

The default encoding on your editor should be UTF-8.编辑器上的默认编码应该是 UTF-8。 Check you settings with sys.getdefaultencoding() .使用sys.getdefaultencoding()检查您的设置。

def find_context(word_, n_before, n_after, string_):
    # finds the word and n words before and after it
    import re
    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'
    return re.search(pattern, string_).groups(1)[0]

s = "Frigocongelatore,  capacità di 215 litri, h 122 cm, classe A+"

# find 0 words before and 3 after the word capacità
print(find_context('capacità',0,3,s) )

capacità di 215 litri

print(find_context(' capacit\u00e0',0,3,s) )

 capacità di 215 litri

使用 Python 的正则表达式中的重音字符

问题描述

2 个解决方案

解决方案1
1 2015-10-05 22:54:21

解决方案2
0 2015-10-05 22:55:58

使用 Python 的正则表达式中的重音字符

问题描述

2 个解决方案

解决方案1 1 2015-10-05 22:54:21

解决方案2 0 2015-10-05 22:55:58

解决方案1
1 2015-10-05 22:54:21

解决方案2
0 2015-10-05 22:55:58