如何在正则表达式中包含重音词

Question

I have a utf-8 text with capitalized words within the text: 我有一个utf-8文本，其中包含大写单词：

La cinta, que hoy se estrena en nuestro país, competirá contra Hors la
Loi, de Argelia, Dogtooth, de Grecia, Incendies, de Canadá, Life above
all , de Sudáfrica, y con la ganadora del Globo de Oro, In A Better
World, de Dinamarca.

The desired output is to replace all words that starts with a capital letter to a placeholder (ie #NE# ), except for the first word. 理想的输出是替换所有以大写字母开头的占位符（即#NE# ），第一个单词除外。 So the desired output look as such: 因此，所需的输出如下所示：

La cinta, que hoy se estrena en nuestro país, competirá contra  #NE#
la  #NE# , de #NE# ,  #NE# , de  #NE# ,  #NE# , de  #NE#,  #NE# above
all , de #NE# , y con la ganadora del  #NE# de  #NE# ,  #NE# A #NE# #NE# , de  #NE# .

I've tried using regex as follows: 我已经尝试过使用正则表达式，如下所示：

>>> import re
>>> def blind_CAPS_without_first_word(text):
...     first_word, _, the_rest = text.partition(' ')
...     blinded = re.sub('(?:[A-Z][\w]+\s*)', ' #NE# ', the_rest)
...     return " ".join([first_word, blinded])
... 
>>> text = "La cinta, que hoy se estrena en nuestro país, competirá contra Hors la Loi, de Argelia, Dogtooth, de Grecia, Incendies, de Canadá, Life above all , de Sudáfrica, y con la ganadora del Globo de Oro, In A Better World, de Dinamarca."
>>> blind_CAPS_without_first_word(text)

[out]: [出]：

La cinta, que hoy se estrena en nuestro país, competirá contra #NE# la #NE# , de #NE# , #NE# , de #NE# , #NE# , de #NE# á, #NE# above all , de #NE# áfrica, y con la ganadora del #NE# de #NE# , #NE# A #NE# #NE# , de #NE# . 拉辛塔（La cinta），圣埃斯特雷纳河畔埃斯特雷纳（nu hostro nuestropaís），竞争＃NE＃la＃NE＃，de #NE＃，＃NE＃，de #NE＃，＃NE＃，de＃NE＃á，＃NE＃，de＃NE＃非洲，y con la ganadora del＃NE＃de＃NE＃，＃NE＃A＃NE＃＃NE＃，de＃NE＃。

But the regex didn't consider accented characters when using \\w , eg Canadá -> #NE# á ; 但是使用\\w ，正则表达式不考虑重音字符，例如Canadá -> #NE# á ; Sudáfrica -> #NE# áfrica . Sudáfrica -> #NE# áfrica 。 How do I get around this? 我该如何解决？ How to include accented words in my regex? 如何在我的正则表达式中包含重音词？ It needs to be Canadá -> #NE# ; 它必须是Canadá -> #NE# ; Sudáfrica -> #NE# . Sudáfrica -> #NE# 。

I guess it's okay if to ignore single character words like A remains as A . 我想如果忽略像A这样A单个字符单词仍然是A 。 Unless there's a get around for this. 除非有解决方法。

Answer 1

Because \\w+ or [\\w]+ won't match accented characters. 因为\\w+或[\\w]+与重音字符不匹配。 So it fails to match those words. 因此它无法匹配这些单词。

You may use \\S+ instead of \\w+ 您可以使用\\S+代替\\w+

re.sub(r'[A-Z]\S+\s*', ' #NE# ', the_rest)

OR 要么

Use regex module if you only wants to match word chars of any language. 如果只想匹配任何语言的单词字符，请使用regex模块。

regex.sub(r'[A-Z]\p{L}+\s*', ' #NE# ', the_rest)

Answer 2

Any chance you could use unicode notation to capture ranges of characters? 您是否有可能使用unicode表示法捕获字符范围？ Example: [\\xC0-\\xE1] or something? 示例：[\\ xC0- \\ xE1]还是其他？ I ran it by Pythex and it didn't seem to mind... you'll need to find your own range, but it's a start :) 我是由Pythex运行的，它似乎并不介意...您需要找到自己的范围，但这只是一个开始:)

Hope this helps. 希望这可以帮助。

如何在正则表达式中包含重音词

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-06-21 02:55:49

解决方案2
0 2015-06-21 03:59:48

如何在正则表达式中包含重音词

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-06-21 02:55:49

解决方案2 0 2015-06-21 03:59:48

解决方案1
4 已采纳 2015-06-21 02:55:49

解决方案2
0 2015-06-21 03:59:48