python使用正则表达式替换字符串中的非拉丁词

Question

I am trying to do this: 我正在尝试这样做：

val = re.sub(r'\\b' + u_word +'\\b', unicode(new_word), u_text)

(All strings are non-latin.) （所有字符串均为非拉丁）。

It does not work, at all!. 根本不起作用！

Is it possible to find-replace non-latin words (whole words) in a non-latin text with regex? 是否可以使用正则表达式查找替换非拉丁文字中的非拉丁文字（整个单词）？ How? 怎么样？

EDIT: 编辑：

If you want to test try these strings: 如果要测试，请尝试以下字符串：

>>> u_word = u'αβ'
>>> u_text = u'αβγ αβ αβγδ δαβ'
>>> new_word = u'χχ'
>>> val = re.sub(r'\b' + u_word +r'\b', unicode(new_word), u_text)
>>> val
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>> u_text
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>>

Answer 1

You need to pass the re.UNICODE flag to sub , like so: 您需要将re.UNICODE标志传递给sub ，如下所示：

val = re.sub(r'\b' + u_word + r'\b', unicode(new_word), u_text, flags=re.UNICODE)

\\b is a word boundary. \\b是单词边界。 Without the re.UNICODE flag, a "word" contains only characters from the set [a-zA-Z0-9_] , so αβ isn't seen as a "word". 没有re.UNICODE标志，“单词”仅包含集合[a-zA-Z0-9_] ，因此αβ不被视为“单词”。 For more information see the re documentation (specifically \\b , \\w , and re.UNICODE ). 有关更多信息，请参见re文档（特别是\\b ， \\w和re.UNICODE ）。

FYI: 供参考：

If new_word is already a unicode string (as in your example), unicode(new_word) is superfluous, it returns new_word unmodified . 如果new_word已经是unicode字符串（如您的示例），则unicode(new_word)是多余的，它将返回new_word unmodified 。
In Python 3.x, unicode is no longer a special case. 在Python 3.x中，unicode不再是一种特殊情况。 Your code would have worked as is in Python 3.x (minus unicode() which was removed because it's no longer necessary). 您的代码可以像在Python 3.x中一样工作（减去unicode() ，因为不再需要它而被删除）。

python使用正则表达式替换字符串中的非拉丁词

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-12-06 21:07:01

python使用正则表达式替换字符串中的非拉丁词

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-12-06 21:07:01

解决方案1
1 已采纳 2012-12-06 21:07:01