简体   繁体   English

向量中每个元素的R grep

[英]R grep for each element in vector

I have two data frames: 我有两个数据框:

> a
    box        hits
1 px085 agx|amx|app
2 px075 gxz|gpx|amr
3 px065 abc|apr|ppy
4 rx055 alo|amx|bbc
5 rx088 ppy|pxg|ptr
6 rx099 prt|ppm|zee

> b
  hitcode appid
1     agx 12485
2     abc 18550
3     bbc 19225
4     ppy 15260
5     zee 16880

I'm trying to get output: 我正在尝试获得输出:

    box        hits appcode
1 px085 agx|amx|app   12485
2 px075 gxz|gpx|amr       
3 px065 abc|apr|ppy   18550
4 rx055 alo|amx|bbc   19225
5 rx088 ppy|pxg|ptr   15260
6 rx099 prt|ppm|zee   16880

I tried: 我试过了:

gcode <- function(x){
  b[grep(x, b$hitcode, ignore.case = TRUE, perl = TRUE), c('appid')]
}

Which is giving me: 这给了我:

> gcode(a$hits)
#[1] 12485
#Warning message:
#In grep(x, b$hitcode, ignore.case = TRUE, perl = TRUE) :
#  argument 'pattern' has length > 1 and only the first element will be used

What am I missing here? 我在这里想念什么?

As per the comments, your example allows that multiple apps are matched to your hitcodes. 根据注释,您的示例允许多个应用程序与您的匹配代码匹配。 Here's a solution using loops, in which the appid is not being overwritten if multiple matches exist. 这是使用循环的解决方案,其中如果存在多个匹配项,则不会覆盖appid

I assume that your character variables are formatted as factors. 我假设您的字符变量被格式化为因子。 Otherwise, the 1:nlevels(b$hitcode) becomes 1:length(b$hitcode) . 否则, 1:nlevels(b$hitcode)变为1:length(b$hitcode)

a$appid <- as.character(NA)

for(i in 1:nlevels(b$hitcode)){
   cur <- b$hitcode[i]
   hit <- grep(cur, a$hits)
   app <- b$appid[i]

   na <- is.na(a$appid[hit])
   a$appid[ hit[na] ] <- app
   a$appid[ hit[!na] ] <- paste(a$appid[ hit[!na] ],app,sep=";")

}

This gives: 这给出:

# > a
#     box        hits       appid
# 1 px085 agx|amx|app       12485
# 2 px075 gxz|gpx|amr        <NA>
# 3 px065 abc|apr|ppy 18550;15260
# 4 rx055 alo|amx|bbc       19225
# 5 rx088 ppy|pxg|ptr       15260
# 6 rx099 prt|ppm|zee       16880

You could try: 您可以尝试:

library(dplyr)
library(tidyr)
library(stringi)

a %>% 
  separate(hits, into = paste(1:3), remove = FALSE) %>%
  gather(key, value, -box, -hits) %>%
  left_join(., b, by = c("value" = "hitcode")) %>% 
  group_by(box, hits) %>%
  summarise(appid = toString(appid) %>% stri_extract_all(., regex = "[:digit:]+"))

This will store the appid results in a list that you can access later 这会将appid结果存储在列表中,您以后可以访问

#Source: local data frame [6 x 3]
#Groups: box
#
#    box        hits    appid
#1 px065 abc|apr|ppy <chr[2]>
#2 px075 gxz|gpx|amr <chr[1]>
#3 px085 agx|amx|app <chr[1]>
#4 rx055 alo|amx|bbc <chr[1]>
#5 rx088 ppy|pxg|ptr <chr[1]>
#6 rx099 prt|ppm|zee <chr[1]>

Structure 结构体

#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    6 obs. of  3 variables:
# $ box  : chr  "px065" "px075" "px085" "rx055" ...
# $ hits : chr  "abc|apr|ppy" "gxz|gpx|amr" "agx|amx|app" "alo|amx|bbc" ...
# $ appid:List of 6
#  ..$ : chr  "18550" "15260"
#  ..$ : chr NA
#  ..$ : chr "12485"
#  ..$ : chr "19225"
#  ..$ : chr "15260"
#  ..$ : chr "16880"
# - attr(*, "vars")=List of 1
#  ..$ : symbol box
# - attr(*, "drop")= logi TRUE

Here's an attempt using data.table 这是尝试使用data.table

library(data.table)
indx <- setDT(a)[, grep(hits, b$hitcode), by = box]
indx2 <- setDT(b)[indx$V1, .(indx$box, appid)][, .(toString(appid)), by = .(box = V1)]
setkey(a, box)
a[indx2, appid := i.V1]
a
#      box        hits        appid
# 1: px065 abc|apr|ppy 18550, 15260
# 2: px075 gxz|gpx|amr           NA
# 3: px085 agx|amx|app        12485
# 4: rx055 alo|amx|bbc        19225
# 5: rx088 ppy|pxg|ptr        15260
# 6: rx099 prt|ppm|zee        16880

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM