[英]RegEx for matching all non-words except punctuation?
For sentences like: 对于像这样的句子:
sent = "This i$s a s[[]ample sentence.\nAnd another <<one>>.
\nMoreover, it is 'filtered'!"
I would like to get: 我想得到:
"This is a sample sentence. And another one. Moreover, it is filtered."
Thus, I thought using re.sub
should be the way to go. 因此,我认为使用re.sub
应该是方法。 However, RegEx doesn't work as expected (like it pretty much always does^^). 但是,RegEx不能按预期方式工作(就像它几乎总是一样^^)。
My idea was to use \\W
to match every non-word and then exclude [.,;!?]
to keep the punctuation. 我的想法是使用\\W
匹配每个非单词,然后排除[.,;!?]
以保留标点符号。 The last RegEx I've tried was: 我尝试过的最后一个RegEx是:
re.sub(r"(\W[^\.\,\;\?\!])", "", sent)
Unfortunately, [^\\.\\,\\;\\?\\!]
does match for anything that does not contain an entry of [.,;!?]
, instead of simply saying 'do not match these characters literally'. 不幸的是, [^\\.\\,\\;\\?\\!]
确实匹配不包含[.,;!?]
条目的任何内容,而不是简单地说“从字面上不匹配这些字符”。
How can I exclude these characters from match? 如何排除这些字符?
The \\W
needs to be integrated into the negated character class. \\W
需要集成到否定字符类中。 \\W
is the same as [^\\w]
, so you'll end up with [^\\w.,;!?]
. \\W
与[^\\w]
,因此您将以[^\\w.,;!?]
结尾。 You should repeat this character class, to match contiguous occurences in a single step - [^\\w.,;!?]+
. 您应该重复此字符类,以在单个步骤中匹配连续出现的内容- [^\\w.,;!?]+
。
It seems you also want to keep spaces, so you should add them to your character class. 看来您也想保留空格,因此应将其添加到角色类中。
Reeding deeper into your question, you also want to replace newlines with a space and !
深入探讨您的问题,您还想用空格和!
替换换行符!
with .
与.
. 。 This makes it a multiple step solution. 这使其成为一个多步骤解决方案。 First filter out anything unwanted [^\\w.,;!? \\n]+
首先过滤掉任何不需要的[^\\w.,;!? \\n]+
[^\\w.,;!? \\n]+
, in a next step replace \\n
with [^\\w.,;!? \\n]+
,下一步将\\n
替换为 and
!
和!
with .
与.
. 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.