[英]Replacing punctuation except intra-word dashes with a space
There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string)
, but it does not work in Python:在 R gsub("[^[:alnum:]['-]", " ", my_string)
已经有一个接近的答案,但它在 Python 中不起作用:
my_string = 'compactified on a calabi-yau threefold @ ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold @ ,.'
给出'compactified on a calab yau threefold @ ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash.因此,它不仅删除了单词内的破折号,还删除了破折号前面单词的最后一个字母。 And it does not remove punctuation它不会删除标点符号
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'
预期结果(没有任何标点符号但字内破折号的字符串): 'compactified on a calabi-yau threefold'
R uses TRE (POSIX) or PCRE regex engine depending on the perl
option (or function used). R 使用 TRE (POSIX) 或 PCRE 正则表达式引擎,具体取决于perl
选项(或使用的函数)。 Python uses a modified, much poorer Perl-like version as re
library. Python 使用修改过的、更差的 Perl 类版本作为re
库。 Python does not support POSIX character classes , as [:alnum:]
that matches alpha (letters) and num (digits). Python 不支持POSIX 字符类,如匹配alpha (字母)和num (数字)的[:alnum:]
。
In Python, [:alnum:]
can be replaced with [^\\W_]
(or ASCII only [a-zA-Z0-9]
) and the negated [^[:alnum:]]
- with [\\W_]
( [^a-zA-Z0-9]
ASCII only version).在 Python 中, [:alnum:]
可以替换为[^\\W_]
(或仅 ASCII [a-zA-Z0-9]
)和否定的[^[:alnum:]]
- 用[\\W_]
( [^a-zA-Z0-9]
仅 ASCII 版本)。
The [^[:alnum:]['-]
matches any 1 symbol other than alphanumeric (letter or digit), [
, '
, or -
. [^[:alnum:]['-]
匹配除字母数字(字母或数字)、 [
、 '
或-
之外的任何 1 个符号。 That means the R question you refer to does not provide a correct answer .这意味着您提到的 R 问题没有提供正确答案。
You can use the following solution :您可以使用以下解决方案:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No - d'Ante compactified on a calabi-yau threefold @ ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\\b[-']\\b)|[\\W_]
regex matches and captures intraword -
and '
and we restore them in the re.sub
by checking if the capture group matched and re-inserting it with m.group(1)
, and the rest (all non-word characters and underscores) are just replaced with a space.该(\\b[-']\\b)|[\\W_]
正则表达式匹配和捕获intraword -
和'
,我们在还原它们re.sub
通过检查捕获组匹配,并重新插入它m.group(1)
,其余(所有非单词字符和下划线)只是用空格替换。
If you want to remove sequences of non-word characters with one space, use如果要删除一个空格的非单词字符序列,请使用
p = re.compile(r"(\b[-']\b)|[\W_]+")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.