[英]python find-replace non-latin word in string with regex
I am trying to do this: 我正在尝试这样做:
val = re.sub(r'\\b' + u_word +'\\b', unicode(new_word), u_text)
(All strings are non-latin.) (所有字符串均为非拉丁)。
It does not work, at all!. 根本不起作用!
Is it possible to find-replace non-latin words (whole words) in a non-latin text with regex? 是否可以使用正则表达式查找替换非拉丁文字中的非拉丁文字(整个单词)? How? 怎么样?
EDIT: 编辑:
If you want to test try these strings: 如果要测试,请尝试以下字符串:
>>> u_word = u'αβ'
>>> u_text = u'αβγ αβ αβγδ δαβ'
>>> new_word = u'χχ'
>>> val = re.sub(r'\b' + u_word +r'\b', unicode(new_word), u_text)
>>> val
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>> u_text
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>>
You need to pass the re.UNICODE
flag to sub
, like so: 您需要将re.UNICODE
标志传递给sub
,如下所示:
val = re.sub(r'\b' + u_word + r'\b', unicode(new_word), u_text, flags=re.UNICODE)
\\b
is a word boundary. \\b
是单词边界。 Without the re.UNICODE
flag, a "word" contains only characters from the set [a-zA-Z0-9_]
, so αβ
isn't seen as a "word". 没有re.UNICODE
标志,“单词”仅包含集合[a-zA-Z0-9_]
,因此αβ
不被视为“单词”。 For more information see the re
documentation (specifically \\b
, \\w
, and re.UNICODE
). 有关更多信息,请参见re
文档 (特别是\\b
, \\w
和re.UNICODE
)。
FYI: 供参考:
new_word
is already a unicode string (as in your example), unicode(new_word)
is superfluous, it returns new_word
unmodified . 如果new_word
已经是unicode字符串(如您的示例),则unicode(new_word)
是多余的, 它将返回new_word
unmodified 。 unicode()
which was removed because it's no longer necessary). 您的代码可以像在Python 3.x中一样工作(减去unicode()
,因为不再需要它而被删除)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.