简体   繁体   English

正则表达式删除字母以外的所有内容并删除多个空格

[英]Regex to remove everything except letters and remove multiple spaces

I'm trying to make a single regex to remove everything except: 我正在尝试制作一个正则表达式来删除所有内容,除了:

  1. letters 字母
  2. apostrophe's 撇号的
  3. single spaces 单空格

I tried ([^\\\\p{L} ']+ with a Lookbehind for the extra spaces (?<=\\\\s)\\\\s+ . Each works in isolation: 我尝试了([^\\\\p{L} ']+并为其添加多余空格(?<=\\\\s)\\\\s+

gsub("(?<=\\s)\\s+", "", "I like 56 dogs that's him55.", perl = TRUE)
## [1] "I like 56 dogs that's him55."

gsub("[^\\p{L} ']+", "", "I like 56 dogs that's him55.", perl = TRUE)
## [1] "I like  dogs that's him"

But when I use or ( | ) to connect them: 但是当我使用或( | )连接它们时:

gsub("((?<=\\s)\\s+)|([^\\p{L} ']+)", "", "I like 56 dogs that's him55.", perl = TRUE)

This returns: 返回:

[1] "I like  dogs that's him"

I'd like it to remove the multiple extra space (between like & dogs) like: 我希望它删除多个多余的空间(像&狗之间),例如:

[1] "I like dogs that's him"

How can I use one regex to remove everything except letters, apostrophes and extra spaces? 如何使用一个正则表达式删除除字母,撇号和多余空格以外的所有内容?

看来问题出在您的正则表达式中,这会使每个数字都变成空格,下面的代码对我来说很好用:

gsub("[^\\p{L}']+", " ", "I like 56 dogs that's him55.", perl = TRUE)

You can try the following if you're trying to do this in one call: 如果您要在一个呼叫中尝试执行以下操作,则可以尝试以下操作:

gsub("[^\\pL' ]+\\h+(?=\\h)|\\h+(?=[^\\pL' ]+)|[^\\pL' ]+", "", x, perl=T)
# [1] "I like dogs that's him"

Here is another way you could approach this if you desire which is more efficient IMO. 如果您希望使用更有效的IMO,则可以采用另一种方法来解决此问题。

x <- "I like 56 dogs that's him55."
r <- gsub("[^\\pL' ]+", '', x, perl=T)
paste(strsplit(r, '\\s+')[[1]], collapse = ' ')
# [1] "I like dogs that's him"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM