[英]accented characters in a regex with Python
This is my code这是我的代码
# -*- coding: utf-8 -*-
import json
import re
with open("/Users/paul/Desktop/file.json") as json_file:
file = json.load(json_file)
print file["desc"]
key="capacità"
result = re.findall("((?:[\S,]+\s+){0,3})"+key+"\s+((?:[\S,]+\s*){0,3})", file["desc"], re.IGNORECASE)
print result
This is the content of the file这是文件的内容
{
"desc": "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"
}
My result is []我的结果是 []
but what I want is result = "capacità"但我想要的是 result = "capacità"
You need to treat your string as an Unicode string, like this:您需要将字符串视为 Unicode 字符串,如下所示:
str = u"Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"
And as you can see if you print str.encode('utf-8')
you'll get:正如你所看到的,如果你
print str.encode('utf-8')
你会得到:
Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+
The same way you can make your regex string an unicode or raw string with u
or r
respectively.同样,您可以分别使用
u
或r
使正则表达式字符串成为 unicode 或原始字符串。
You can use this function to display different encodings.您可以使用此功能来显示不同的编码。
The default encoding on your editor should be UTF-8.编辑器上的默认编码应该是 UTF-8。 Check you settings with
sys.getdefaultencoding()
.使用
sys.getdefaultencoding()
检查您的设置。
def find_context(word_, n_before, n_after, string_):
# finds the word and n words before and after it
import re
b= '\w+\W+' * n_before
a= '\W+\w+' * n_after
pattern = '(' + b + word_ + a + ')'
return re.search(pattern, string_).groups(1)[0]
s = "Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+"
# find 0 words before and 3 after the word capacità
print(find_context('capacità',0,3,s) )
capacità di 215 litri
print(find_context(' capacit\u00e0',0,3,s) )
capacità di 215 litri
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.