简体   繁体   English

R gsub数和变量空间

[英]R gsub numbers and space from variables

With gsub I am able to remove the # from these person variables, however the way I am trying to remove the random number is not correct. 使用gsub,我可以从这些人员变量中删除# ,但是我尝试删除随机数的方法不正确。 I also would like to remove the space after the persons name as well but keep the space in the middle of the name. 我也想删除人员姓名后的空格,但将空格保留在姓名中间。

c('mike smith #99','John johnson #2','jeff johnson #50') -> person

c(1:99) -> numbers

person <- gsub("#", "", person, fixed=TRUE)

# MY ISSUE
person <- gsub(numbers, "", person, fixed=TRUE)

df <- data.frame(PERSON = person)

Current Results: 当前结果:

PERSON
mike smith 99
John johnson 2
jeff johnson 50

Expected Results: 预期成绩:

PERSON
mike smith
John johnson
jeff johnson

Here's another pattern as an alternative: 这是另一种替代方法:

> gsub("(\\.*)\\s+#.*", "\\1", person)
[1] "mike smith"   "John johnson" "jeff johnson"

In the above regex, (\\\\.*) will match a subgroup of any characters before a space ( \\\\s+ ) following by # symbol and following by anything. 在上面的正则表达式中, (\\\\.*)将匹配空格( \\\\s+ )之前的任何字符的子组, \\\\s+ #符号,后跟任何符号。 Then \\\\1 indicates that gsub should replace all the original string with that subgroup (\\\\.*) 然后\\\\1表示gsub应该用该子组(\\\\.*)替换所有原始字符串(\\\\.*)

An easier way to get your desired output is : 一种获得所需输出的简单方法是:

> gsub("\\s+#.*$", "", person)
[1] "mike smith"   "John johnson" "jeff johnson"

The above regex \\\\s+#.*$ indicates that everything consisting of space ( \\\\s+ ), a # symbol and everyting else until the end of string ( \\.$ ) should be removed. 上面的正则表达式\\\\s+#.*$表示应删除所有由空格( \\\\s+ ), #符号和其他所有字符组成的字符串,直到字符串结尾( \\.$ )。

Using str_extract_all from stringr package 使用str_extract_all从stringr包

> library(stringr)
> str_extract_all(person, "[[a-z]]+", simplify = TRUE)
     [,1]   [,2]     
[1,] "mike" "smith"  
[2,] "ohn"  "johnson"
[3,] "jeff" "johnson"

Also you can use: 您也可以使用:

library(stringi)
stri_extract_all(person, regex="[[a-z]]+", simplify=TRUE)
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
sub("\\s+#.*", "", person)
[1] "mike smith"   "John johnson" "jeff johnson"

We can create the pattern with paste 我们可以用paste创建图案

pat <- paste0("\\s*#(", paste(numbers, collapse = "|"), ")")
gsub(pat, "", person)
#[1] "mike smith"   "John johnson" "jeff johnson"

Note that the above solution was based on creating pattern with 'numbers'. 请注意,以上解决方案基于使用“数字”创建模式。 If it is only to remove the numbers after the # including it 如果只是删除包含它的#号之后的数字

sub("\\s*#\\d+$", "", person)
#[1] "mike smith"   "John johnson" "jeff johnson"

Or another option is 或另一个选择是

unlist(strsplit(person, "\\s*#\\d+"))

NOTE: All the above are base R methods 注意:以上所有都是base R方法


library(tidyverse)
data_frame(person) %>% 
      separate(person, into = c("person", "notneeded"), "\\s+#") %>% 
      select(person)

This could alternately be done with read.table . 也可以使用read.table完成此操作。

read.table(text = person, sep = "#", strip.white = TRUE, 
  as.is = TRUE, col.names = "PERSON")

giving: 赠送:

        PERSON
1   mike smith
2 John johnson
3 jeff johnson

An alternative that deletes any sequence of non (lowercase) alphabetic characters at the end of the string. 另一种选择是删除字符串末尾的任何非(小写)字母字符序列。

gsub("[^a-z]+$", "", person)
[1] "mike smith"   "John johnson" "jeff johnson"

If you want to allow for words that are all upper case or end with an uppercase character. 如果要允许全部为大写或以大写字符结尾的单词。

gsub("[^a-zA-Z]+$", "", person)

Some names might end with . 有些名称可能以结尾. :

gsub("[^a-zA-Z.]+$", "", person)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM