简体   繁体   English

RegEx是否可以匹配除标点符号之外的所有非单词?

[英]RegEx for matching all non-words except punctuation?

For sentences like: 对于像这样的句子:

sent = "This i$s a s[[]ample sentence.\nAnd another <<one>>.
        \nMoreover, it is 'filtered'!"

I would like to get: 我想得到:

"This is a sample sentence. And another one. Moreover, it is filtered."

Thus, I thought using re.sub should be the way to go. 因此,我认为使用re.sub应该是方法。 However, RegEx doesn't work as expected (like it pretty much always does^^). 但是,RegEx不能按预期方式工作(就像它几乎总是一样^^)。

My idea was to use \\W to match every non-word and then exclude [.,;!?] to keep the punctuation. 我的想法是使用\\W匹配每个非单词,然后排除[.,;!?]以保留标点符号。 The last RegEx I've tried was: 我尝试过的最后一个RegEx是:

re.sub(r"(\W[^\.\,\;\?\!])", "", sent)

Unfortunately, [^\\.\\,\\;\\?\\!] does match for anything that does not contain an entry of [.,;!?] , instead of simply saying 'do not match these characters literally'. 不幸的是, [^\\.\\,\\;\\?\\!]确实匹配不包含[.,;!?]条目的任何内容,而不是简单地说“从字面上匹配这些字符”。

How can I exclude these characters from match? 如何排除这些字符?

The \\W needs to be integrated into the negated character class. \\W需要集成到否定字符类中。 \\W is the same as [^\\w] , so you'll end up with [^\\w.,;!?] . \\W[^\\w] ,因此您将以[^\\w.,;!?]结尾。 You should repeat this character class, to match contiguous occurences in a single step - [^\\w.,;!?]+ . 您应该重复此字符类,以在单个步骤中匹配连续出现的内容- [^\\w.,;!?]+

It seems you also want to keep spaces, so you should add them to your character class. 看来您也想保留空格,因此应将其添加到角色类中。

Reeding deeper into your question, you also want to replace newlines with a space and ! 深入探讨您的问题,您还想用空格和!替换换行符! with . . . This makes it a multiple step solution. 这使其成为一个多步骤解决方案。 First filter out anything unwanted [^\\w.,;!? \\n]+ 首先过滤掉任何不需要的[^\\w.,;!? \\n]+ [^\\w.,;!? \\n]+ , in a next step replace \\n with [^\\w.,;!? \\n]+ ,下一步将\\n替换为 and ! ! with . . .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM