简体   繁体   English

R正则表达式删除字母之间的撇号

[英]R regex remove apostrophes NOT between letters

I'm able to remove all punctuation from a string while keeping apostrophes, but I'm now stuck on how to remove any apostrophes that are not between two letters. 我可以在保留撇号的同时从字符串中删除所有标点符号,但我现在仍然坚持如何删除不在两个字母之间的任何撇号。

str1 <- "I don't know 'how' to remove these ' things"

Should look like this: 应该是这样的:

"I don't know how to remove these things"

You may use a regex approach: 您可以使用正则表达式方法:

str1 <- "I don't know 'how' to remove these ' things"
gsub("\\s*'\\B|\\B'\\s*", "", str1)

See this IDEONE demo and a regex demo . 请参阅此IDEONE演示正则表达式演示

The regex matches: 正则表达式匹配:

  • \\\\s*'\\\\B - 0+ whitespaces, ' and a non-word boundary \\\\s*'\\\\B - 0+空格, '和非字边界
  • | - or - 要么
  • \\\\B'\\\\s* - a non-word boundary, ' and 0+ whitespaces \\\\B'\\\\s* - 非字边界'和0+空格

If you do not need to care about the extra whitespace that can remain after removing standalone ' , you can use a PCRE regex like 如果你不需要关心多余的空格,可以保持消除独立后' ,你可以使用正则表达式PCRE像

\b'\b(*SKIP)(*F)|'

See the regex demo 请参阅正则表达式演示

Explanation : 说明

  • \\b'\\b - match a ' in-between word characters \\b'\\b - 匹配'中间的单词字符
  • (*SKIP)(*F) - and omit the match (*SKIP)(*F) - 并省略匹配
  • | - or match... - 或匹配......
  • ' - an apostrophe in another context. ' - 另一种情况下的撇号。

See an IDEONE demo : 查看IDEONE演示

gsub("\\b'\\b(*SKIP)(*F)|'", "", str1, perl=TRUE)

To account for apostrophes in-between Unicode letters , add (*UTF)(*UCP) flags at the start of the pattern and use a perl=TRUE argument: 要考虑Unicode字母之间的撇号,在模式的开头添加(*UTF)(*UCP)标志并使用perl=TRUE参数:

gsub("(*UTF)(*UCP)\\s*'\\B|\\B'\\s*", "", str1, perl=TRUE)
      ^^^^^^^^^^^^                              ^^^^^^^^^     

Or 要么

gsub("(*UTF)(*UCP)\\b'\\b(*SKIP)(*F)|'", "", str1, perl=TRUE) 
      ^^^^^^^^^^^^                                 

See another IDEONE demo 请参阅另一个IDEONE演示

This method using gsub work: 这个方法使用gsub工作:

gsub("(([^A-Za-z])'|'([^A-Za-z]))", "\\2 ", str1)

"I don't know  how to remove these   things"

It would require a second round to remove extra spaces. 这将需要第二轮来移除额外的空间。 So 所以

gsub("  +", " ", gsub("(([^A-Za-z])'|'([^A-Za-z]))", "\\2 ", str1))
  • [^A-Za-z] says all non-alphabetical characters [^ A-Za-z]表示所有非字母字符
  • | | is an or statement 是一个或声明
  • () capture matched sub-expressions ()捕获匹配的子表达式
  • \\\\2 is called a back reference and returns the second captured sub-expressions \\\\ 2被称为后向引用并返回第二个捕获的子表达式

Here's one approach using lookarounds in base: 这是使用基础中的lookarounds的一种方法:

gsub("(?<![a-zA-Z])(')|(')(?![a-zA-Z])", "", str1, perl=TRUE)
## [1] "I don't know how to remove these  things"

正则表达式可视化

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM