R正则表达式删除字母之间的撇号

Question

I'm able to remove all punctuation from a string while keeping apostrophes, but I'm now stuck on how to remove any apostrophes that are not between two letters. 我可以在保留撇号的同时从字符串中删除所有标点符号，但我现在仍然坚持如何删除不在两个字母之间的任何撇号。

str1 <- "I don't know 'how' to remove these ' things"

Should look like this: 应该是这样的：

"I don't know how to remove these things"

Answer 1

You may use a regex approach: 您可以使用正则表达式方法：

str1 <- "I don't know 'how' to remove these ' things"
gsub("\\s*'\\B|\\B'\\s*", "", str1)

See this IDEONE demo and a regex demo . 请参阅此IDEONE演示和正则表达式演示。

The regex matches: 正则表达式匹配：

\\\\s*'\\\\B - 0+ whitespaces, ' and a non-word boundary \\\\s*'\\\\B - 0+空格， '和非字边界
| - or - 要么
\\\\B'\\\\s* - a non-word boundary, ' and 0+ whitespaces \\\\B'\\\\s* - 非字边界'和0+空格

If you do not need to care about the extra whitespace that can remain after removing standalone ' , you can use a PCRE regex like 如果你不需要关心多余的空格，可以保持消除独立后' ，你可以使用正则表达式PCRE像

\b'\b(*SKIP)(*F)|'

See the regex demo 请参阅正则表达式演示

Explanation : 说明：

\\b'\\b - match a ' in-between word characters \\b'\\b - 匹配'中间的单词字符
(*SKIP)(*F) - and omit the match (*SKIP)(*F) - 并省略匹配
| - or match... - 或匹配......
' - an apostrophe in another context. ' - 另一种情况下的撇号。

See an IDEONE demo : 查看IDEONE演示：

gsub("\\b'\\b(*SKIP)(*F)|'", "", str1, perl=TRUE)

To account for apostrophes in-between Unicode letters , add (*UTF)(*UCP) flags at the start of the pattern and use a perl=TRUE argument: 要考虑Unicode字母之间的撇号，在模式的开头添加(*UTF)(*UCP)标志并使用perl=TRUE参数：

gsub("(*UTF)(*UCP)\\s*'\\B|\\B'\\s*", "", str1, perl=TRUE)
      ^^^^^^^^^^^^                              ^^^^^^^^^

Or 要么

gsub("(*UTF)(*UCP)\\b'\\b(*SKIP)(*F)|'", "", str1, perl=TRUE) 
      ^^^^^^^^^^^^

See another IDEONE demo 请参阅另一个IDEONE演示

Answer 2

This method using gsub work: 这个方法使用gsub工作：

gsub("(([^A-Za-z])'|'([^A-Za-z]))", "\\2 ", str1)

"I don't know  how to remove these   things"

It would require a second round to remove extra spaces. 这将需要第二轮来移除额外的空间。 So 所以

gsub("  +", " ", gsub("(([^A-Za-z])'|'([^A-Za-z]))", "\\2 ", str1))

[^A-Za-z] says all non-alphabetical characters [^ A-Za-z]表示所有非字母字符
| | is an or statement 是一个或声明
() capture matched sub-expressions （）捕获匹配的子表达式
\\\\2 is called a back reference and returns the second captured sub-expressions \\\\ 2被称为后向引用并返回第二个捕获的子表达式

Answer 3

Here's one approach using lookarounds in base: 这是使用基础中的lookarounds的一种方法：

gsub("(?<![a-zA-Z])(')|(')(?![a-zA-Z])", "", str1, perl=TRUE)
## [1] "I don't know how to remove these  things"

正则表达式可视化

R正则表达式删除字母之间的撇号

问题描述

3 个解决方案

解决方案1
4 已采纳 2016-06-12 20:21:15

解决方案2
4 2016-06-12 20:21:22

解决方案3
3 2016-06-12 20:21:27

R正则表达式删除字母之间的撇号

问题描述

3 个解决方案

解决方案1 4 已采纳 2016-06-12 20:21:15

解决方案2 4 2016-06-12 20:21:22

解决方案3 3 2016-06-12 20:21:27

解决方案1
4 已采纳 2016-06-12 20:21:15

解决方案2
4 2016-06-12 20:21:22

解决方案3
3 2016-06-12 20:21:27