简体   繁体   English

用空格替换除词内破折号之外的标点符号

[英]Replacing punctuation except intra-word dashes with a space

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string) , but it does not work in Python:在 R gsub("[^[:alnum:]['-]", " ", my_string)已经有一个接近的答案,但它在 Python 中不起作用:

my_string = 'compactified on a calabi-yau threefold @ ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)

gives 'compactified on a calab yau threefold @ ,.'给出'compactified on a calab yau threefold @ ,.'

So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash.因此,它不仅删除了单词内的破折号,还删除了破折号前面单词的最后一个字母。 And it does not remove punctuation它不会删除标点符号

Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'预期结果(没有任何标点符号但字内破折号的字符串): 'compactified on a calabi-yau threefold'

R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). R 使用 TRE (POSIX) 或 PCRE 正则表达式引擎,具体取决于perl选项(或使用的函数)。 Python uses a modified, much poorer Perl-like version as re library. Python 使用修改过的、更差的 Perl 类版本作为re库。 Python does not support POSIX character classes , as [:alnum:] that matches alpha (letters) and num (digits). Python 不支持POSIX 字符类,如匹配alpha (字母)和num (数字)的[:alnum:]

In Python, [:alnum:] can be replaced with [^\\W_] (or ASCII only [a-zA-Z0-9] ) and the negated [^[:alnum:]] - with [\\W_] ( [^a-zA-Z0-9] ASCII only version).在 Python 中, [:alnum:]可以替换为[^\\W_] (或仅 ASCII [a-zA-Z0-9] )和否定的[^[:alnum:]] - 用[\\W_] ( [^a-zA-Z0-9]仅 ASCII 版本)。

The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [ , ' , or - . [^[:alnum:]['-]匹配除字母数字(字母或数字)、 ['-之外的任何 1 个符号 That means the R question you refer to does not provide a correct answer .这意味着您提到的 R 问题没有提供正确答案

You can use the following solution :您可以使用以下解决方案

import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No -  d'Ante compactified on a calabi-yau threefold @ ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)

The (\\b[-']\\b)|[\\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1) , and the rest (all non-word characters and underscores) are just replaced with a space.(\\b[-']\\b)|[\\W_]正则表达式匹配和捕获intraword -' ,我们在还原它们re.sub通过检查捕获组匹配,并重新插入它m.group(1) ,其余(所有非单词字符和下划线)只是用空格替换。

If you want to remove sequences of non-word characters with one space, use如果要删除一个空格的非单词字符序列,请使用

p = re.compile(r"(\b[-']\b)|[\W_]+") 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 除标点符号外的字数频率 - word count frequency except punctuation python 在除撇号外的所有空格和标点符号处重新拆分 - python re split at all space and punctuation except for the apostrophe 删除每个单词除第一个字母以外的所有字母并保留标点符号 - Delete all letters except the first letter of each word and keep punctuation 用另一个词替换一个句子中的单词,但不能很好地标点符号 - Replacing word in a sentence by another one work but does not output well the punctuation Python 3 正则表达式:删除所有标点符号,特殊单词模式除外 - Python 3 Regex: remove all punctuation, except special word pattern re.sub在标点符号和单词之间放置空格,其中单词以标点符号开头或结尾 - re.sub put space between punctuation and word where word starts or ends with punctuation 用逗号空格替换空格,除了在行尾python - replacing space with comma-space, except at end of line python 用文本文件中的单词替换特定空格 - Replacing a specific space with a word in a text file Python-如何通过空格将标点符号与单词分开,在标点符号和单词之间仅留一个空格? - Python - How do I separate punctuation from words by white space leaving only one space between the punctuation and the word? 替换文件列表中的标点符号 - Replacing Punctuation in List of Files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM