简体   繁体   English

使用 base 或 dplyr 在几列上过滤 dataframe

[英]filter dataframe on several columns using base or dplyr

I'm trying to filter the rows of a data frame based on columns in another datagram.我正在尝试根据另一个数据报中的列过滤数据帧的行。 Basically, I want to extract rows with the same IDs where the position is between start and end.基本上,我想提取具有相同 ID 的行,其中 position 位于开始和结束之间。 There is the extra trick that the IDs are formatted differently.还有一个额外的技巧是 ID 的格式不同。
finally, the data involved in the script is huge so to save memory or speed is nice to have.最后,脚本中涉及的数据很大,所以保存 memory 或速度很好。
would be grateful to get some tips.将不胜感激得到一些提示。

library(dplyr)

df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), 
                  pos = c(30, 40, 50, 35, 45, 55, 60, 63, 39))

df2 <- data.frame(idstr = c("id1", "id1", "id3", "id4", "id4"), 
                  start=c(30, 20, 30, 40, 20 ),
                  end = c(40, 30, 50, 60, 45))

df.base <- df1[ paste0("id", df1$id) == df2$idstr && 
                 df1$pos >= df2$start &&
                 df1$pos <= df2$end,]

df.dplyr <- df1 %>%
            left_join(df2, by  = c('id' == 'idstr') ) %>%
            filter(pos >= start & pos <= end) %>%
            select(id, pos)

edit: expected output, the rows from df1 meeting the condition (their position is in a range of df2 with the same id), so if no mistake: id, pos编辑:预期 output,来自 df1 的行满足条件(他们的 position 在 df2 的范围内,具有相同的 id),所以如果没有错误:id,pos
1, 30 1、30
1, 40 1, 40
3, 39 3、39

explanation: for example, df1[3,] id == 1 and pos == 50 looking at df2, there is no row where df2$id == "id1" and df2$start <= 50 and df2$end >= 50, so df1[3,] would be filtered out.解释:例如,df1[3,] id == 1 and pos == 50 查看 df2,没有 df2$id == "id1" and df2$start <= 50 and df2$end >= 50 的行,所以 df1[3,] 将被过滤掉。

We can use non-equi join in data.table .我们可以在data.table中使用非等值连接。 Create the 'id' similar in both datasetss and then join on the 'id' columns and non-equi join with 'pos' and 'start', 'end' columns在两个数据集中创建类似的“id”,然后on “id”列,并使用“pos”和“start”、“end”列进行非等连接

library(data.table)
setDT(df1)[, id := paste0('id', id)]
df1[df2, on = .(id = idstr, pos >= start, pos <= end)]

I have taken your 2 DF df1 and df2 , mutated column idstr from df2 into a numeric by extracting the digits.我已通过提取数字将您的 2 DF df1df2idstrdf2变异为数字。 Then with a left_join , group_by and filter I get the result.然后使用left_joingroup_byfilter我得到结果。

library(dplyr)


df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), pos = c(30, 40, 50, 35, 45, 55, 60, 63, 39))

df2 <- data.frame(idstr = c("id1", "id1", "id3", "id4", "id4"), 
                  start=c(30, 20, 30, 40, 20 ),
                  end = c(40, 30, 50, 60, 45))


df2 %>% 
  mutate(idstr = as.numeric(stringr::str_extract(idstr, '[0-9]'))) %>% 
  left_join(df1, by = c('idstr' = 'id')) %>% 
  dplyr::filter(pos >= start & pos <= end)
#> # A tibble: 4 x 4
#> # Groups:   idstr [2]
#>   idstr start   end   pos
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    30    40    30
#> 2     1    30    40    40
#> 3     1    20    30    30
#> 4     3    30    50    39

There is one df1$id == 1 which fits into 2 start-end slots in df2 .有一个df1$id == 1适合df2中的 2 个起始位置。 And therefore it has to be 3 positions with id =1.因此它必须是 id = 1 的 3 个位置。 If one of the limits is exlusive - like in the following code - it fits your wish.如果其中一个限制是排他性的 - 就像下面的代码一样 - 它符合您的愿望。


df2 %>% 
  mutate(idstr = as.numeric(stringr::str_extract(idstr, '[0-9]'))) %>% 
  left_join(df1, by = c('idstr' = 'id')) %>% 
  dplyr::filter(pos > start & pos <= end)

#>   idstr start end pos
#> 1     1    30  40  40
#> 2     1    20  30  30
#> 3     3    30  50  39

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM