简体   繁体   English

从R中的向量中删除字符串

[英]Remove string from a vector in R

I have a vector that looks like 我有一个看起来像的向量

> inecodes
   [1] "01001" "01002" "01049" "01003" "01006" "01037" "01008" "01004" "01009" "01010" "01011"
  [12] "01013" "01014" "01016" "01017" "01021" "01022" "01023" "01046" "01056" "01901" "01027"
  [23] "01019" "01020" "01028" "01030" "01031" "01032" "01902" "01033" "01036" "01058" "01034"
  [34] "01039" "01041" "01042" "01043" "01044" "01047" "01051" "01052" "01053" "01054" "01055"

And I want to remove these "numbers" from this vector: 我想从此向量中删除这些“数字”:

>pob
 [1] "01001-Alegría-Dulantzi"           "01002-Amurrio"                   
 [3] "01049-Añana"                      "01003-Aramaio"                   
 [5] "01006-Armiñón"                    "01037-Arraia-Maeztu"             
 [7] "01008-Arratzua-Ubarrundia"        "01004-Artziniega"                
 [9] "01009-Asparrena"                  "01010-Ayala/Aiara"               
[11] "01011-Baños de Ebro/Mañueta"      "01013-Barrundia"                 
[13] "01014-Berantevilla"               "01016-Bernedo"                   
[15] "01017-Campezo/Kanpezu"            "01021-Elburgo/Burgelu"           
[17] "01022-Elciego"                    "01023-Elvillar/Bilar"            
[19] "01046-Erriberagoitia/Ribera Alta"

They are longer that these samples and they don't have the same length. 它们比这些样本更长,并且长度不一样。 The answer must to be like following: 答案必须如下所示:

>pob
     [1] "Alegría-Dulantzi"           "Amurrio"                   
     [3] "Añana"                      "Aramaio"                   
     [5] "Armiñón"                    "Arraia-Maeztu"             
     [7] "Arratzua-Ubarrundia"        "Artziniega"                
     [9] "Asparrena"                  "Ayala/Aiara"               
    [11] "Baños de Ebro/Mañueta"      "Barrundia"                 
    [13] "Berantevilla"               "Bernedo"                   
    [15] "Campezo/Kanpezu"            "Elburgo/Burgelu"           
    [17] "Elciego"                    "Elvillar/Bilar"            
    [19] "Erriberagoitia/Ribera Alta"

Not sure why you needed inecodes at all, since you can use sub to remove all digits: 不确定为什么完全需要inecodes ,因为您可以使用sub删除所有数字:

sub('^\\d+-', '', pob)

Result: 结果:

 [1] "Alegría-Dulantzi"           "Amurrio"                    "Añana"                     
 [4] "Aramaio"                    "Armiñón"                    "Arraia-Maeztu"             
 [7] "Arratzua-Ubarrundia"        "Artziniega"                 "Asparrena"                 
[10] "Ayala/Aiara"                "Baños de Ebro/Mañueta"      "Barrundia"                 
[13] "Berantevilla"               "Bernedo"                    "Campezo/Kanpezu"           
[16] "Elburgo/Burgelu"            "Elciego"                    "Elvillar/Bilar"            
[19] "Erriberagoitia/Ribera Alta"

One reason that you might need inecodes is that you have codes in pob that don't exist in inecodes , but that doesn't seem like the case here. 你可能需要一个理由inecodes是,你必须在代码pob不中不存在inecodes ,但是这似乎并不像这里的情况。 If you insist on using inecodes to remove numbers from pob , you can use str_replace_all from stringr : 如果您坚持使用inecodespob删除数字,则可以使用stringr str_replace_all

library(stringr)

str_replace_all(pob, setNames(rep("", length(inecodes)), paste0(inecodes, "-")))

This gives you the exact same result: 这给您完全相同的结果:

 [1] "Alegría-Dulantzi"           "Amurrio"                    "Añana"                     
 [4] "Aramaio"                    "Armiñón"                    "Arraia-Maeztu"             
 [7] "Arratzua-Ubarrundia"        "Artziniega"                 "Asparrena"                 
[10] "Ayala/Aiara"                "Baños de Ebro/Mañueta"      "Barrundia"                 
[13] "Berantevilla"               "Bernedo"                    "Campezo/Kanpezu"           
[16] "Elburgo/Burgelu"            "Elciego"                    "Elvillar/Bilar"            
[19] "Erriberagoitia/Ribera Alta"

Data: 数据:

inecodes = c("01001", "01002", "01049", "01003", "01006", "01037", "01008", 
"01004", "01009", "01010", "01011", "01013", "01014", "01016", 
"01017", "01021", "01022", "01023", "01046", "01056", "01901", 
"01027", "01019", "01020", "01028", "01030", "01031", "01032", 
"01902", "01033", "01036", "01058", "01034", "01039", "01041", 
"01042", "01043", "01044", "01047", "01051", "01052", "01053", 
"01054", "01055")

pob = c("01001-Alegría-Dulantzi", "01002-Amurrio", "01049-Añana", "01003-Aramaio", 
"01006-Armiñón", "01037-Arraia-Maeztu", "01008-Arratzua-Ubarrundia", 
"01004-Artziniega", "01009-Asparrena", "01010-Ayala/Aiara", "01011-Baños de Ebro/Mañueta", 
"01013-Barrundia", "01014-Berantevilla", "01016-Bernedo", "01017-Campezo/Kanpezu", 
"01021-Elburgo/Burgelu", "01022-Elciego", "01023-Elvillar/Bilar", 
"01046-Erriberagoitia/Ribera Alta")
library(stringr)

for(code in inecodes) {
  ix <- which(str_detect(pob, code))
  pob[ix] <- unlist(str_split(pob, "-", 2))[2]
}

Try this. 尝试这个。 Match should be much faster 比赛应该快得多

pos<-which(!is.na(pob[match(sub('^([0-9]+)-.*$','\\1',pob),inecodes)]))
pob[pos]<-sub('^[0-9]+-(.*)$','\\1',pob[pos])

Please do post the timings if you manage to get this. 如果您设法做到这一点,请发布时间。 Match usually solves many computational issues for large data sets lookup. 匹配通常解决大型数据集查找的许多计算问题。 Would like to see if there are any opposite scenarios. 想看看是否有相反的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM