用空格替换除词内破折号之外的标点符号

Question

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string) , but it does not work in Python:在 R gsub("[^[:alnum:]['-]", " ", my_string)已经有一个接近的答案，但它在 Python 中不起作用：

my_string = 'compactified on a calabi-yau threefold @ ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)

gives 'compactified on a calab yau threefold @ ,.'给出'compactified on a calab yau threefold @ ,.'

So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash.因此，它不仅删除了单词内的破折号，还删除了破折号前面单词的最后一个字母。 And it does not remove punctuation它不会删除标点符号

Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'预期结果（没有任何标点符号但字内破折号的字符串）： 'compactified on a calabi-yau threefold'

Answer 1

R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). R 使用 TRE (POSIX) 或 PCRE 正则表达式引擎，具体取决于perl选项（或使用的函数）。 Python uses a modified, much poorer Perl-like version as re library. Python 使用修改过的、更差的 Perl 类版本作为re库。 Python does not support POSIX character classes , as [:alnum:] that matches alpha (letters) and num (digits). Python 不支持POSIX 字符类，如匹配alpha （字母）和num （数字）的[:alnum:] 。

In Python, [:alnum:] can be replaced with [^\\W_] (or ASCII only [a-zA-Z0-9] ) and the negated [^[:alnum:]] - with [\\W_] ( [^a-zA-Z0-9] ASCII only version).在 Python 中， [:alnum:]可以替换为[^\\W_] （或仅 ASCII [a-zA-Z0-9] ）和否定的[^[:alnum:]] - 用[\\W_] ( [^a-zA-Z0-9]仅 ASCII 版本）。

The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [ , ' , or - . [^[:alnum:]['-]匹配除字母数字（字母或数字）、 [ 、 '或-之外的任何 1 个符号。 That means the R question you refer to does not provide a correct answer .这意味着您提到的 R 问题没有提供正确答案。

You can use the following solution :您可以使用以下解决方案：

import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No -  d'Ante compactified on a calabi-yau threefold @ ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)

The (\\b[-']\\b)|[\\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1) , and the rest (all non-word characters and underscores) are just replaced with a space.该(\\b[-']\\b)|[\\W_]正则表达式匹配和捕获intraword -和' ，我们在还原它们re.sub通过检查捕获组匹配，并重新插入它m.group(1) ，其余（所有非单词字符和下划线）只是用空格替换。

If you want to remove sequences of non-word characters with one space, use如果要删除一个空格的非单词字符序列，请使用

p = re.compile(r"(\b[-']\b)|[\W_]+")

用空格替换除词内破折号之外的标点符号

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-02-24 22:11:52

用空格替换除词内破折号之外的标点符号

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-02-24 22:11:52

解决方案1
5 已采纳 2016-02-24 22:11:52