[英]filter dataframe on several columns using base or dplyr
I'm trying to filter the rows of a data frame based on columns in another datagram.我正在尝试根据另一个数据报中的列过滤数据帧的行。 Basically, I want to extract rows with the same IDs where the position is between start and end.
基本上,我想提取具有相同 ID 的行,其中 position 位于开始和结束之间。 There is the extra trick that the IDs are formatted differently.
还有一个额外的技巧是 ID 的格式不同。
finally, the data involved in the script is huge so to save memory or speed is nice to have.最后,脚本中涉及的数据很大,所以保存 memory 或速度很好。
would be grateful to get some tips.将不胜感激得到一些提示。
library(dplyr)
df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
pos = c(30, 40, 50, 35, 45, 55, 60, 63, 39))
df2 <- data.frame(idstr = c("id1", "id1", "id3", "id4", "id4"),
start=c(30, 20, 30, 40, 20 ),
end = c(40, 30, 50, 60, 45))
df.base <- df1[ paste0("id", df1$id) == df2$idstr &&
df1$pos >= df2$start &&
df1$pos <= df2$end,]
df.dplyr <- df1 %>%
left_join(df2, by = c('id' == 'idstr') ) %>%
filter(pos >= start & pos <= end) %>%
select(id, pos)
edit: expected output, the rows from df1 meeting the condition (their position is in a range of df2 with the same id), so if no mistake: id, pos编辑:预期 output,来自 df1 的行满足条件(他们的 position 在 df2 的范围内,具有相同的 id),所以如果没有错误:id,pos
1, 30 1、30
1, 40 1, 40
3, 39 3、39
explanation: for example, df1[3,] id == 1 and pos == 50 looking at df2, there is no row where df2$id == "id1" and df2$start <= 50 and df2$end >= 50, so df1[3,] would be filtered out.解释:例如,df1[3,] id == 1 and pos == 50 查看 df2,没有 df2$id == "id1" and df2$start <= 50 and df2$end >= 50 的行,所以 df1[3,] 将被过滤掉。
We can use non-equi join in data.table
.我们可以在
data.table
中使用非等值连接。 Create the 'id' similar in both datasetss and then join on
the 'id' columns and non-equi join with 'pos' and 'start', 'end' columns在两个数据集中创建类似的“id”,然后
on
“id”列,并使用“pos”和“start”、“end”列进行非等连接
library(data.table)
setDT(df1)[, id := paste0('id', id)]
df1[df2, on = .(id = idstr, pos >= start, pos <= end)]
I have taken your 2 DF df1
and df2
, mutated column idstr from df2
into a numeric by extracting the digits.我已通过提取数字将您的 2 DF
df1
和df2
列idstr从df2
变异为数字。 Then with a left_join
, group_by
and filter
I get the result.然后使用
left_join
、 group_by
和filter
我得到结果。
library(dplyr)
df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), pos = c(30, 40, 50, 35, 45, 55, 60, 63, 39))
df2 <- data.frame(idstr = c("id1", "id1", "id3", "id4", "id4"),
start=c(30, 20, 30, 40, 20 ),
end = c(40, 30, 50, 60, 45))
df2 %>%
mutate(idstr = as.numeric(stringr::str_extract(idstr, '[0-9]'))) %>%
left_join(df1, by = c('idstr' = 'id')) %>%
dplyr::filter(pos >= start & pos <= end)
#> # A tibble: 4 x 4
#> # Groups: idstr [2]
#> idstr start end pos
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 30 40 30
#> 2 1 30 40 40
#> 3 1 20 30 30
#> 4 3 30 50 39
There is one df1$id == 1
which fits into 2 start-end slots in df2
.有一个
df1$id == 1
适合df2
中的 2 个起始位置。 And therefore it has to be 3 positions with id =1.因此它必须是 id = 1 的 3 个位置。 If one of the limits is exlusive - like in the following code - it fits your wish.
如果其中一个限制是排他性的 - 就像下面的代码一样 - 它符合您的愿望。
df2 %>%
mutate(idstr = as.numeric(stringr::str_extract(idstr, '[0-9]'))) %>%
left_join(df1, by = c('idstr' = 'id')) %>%
dplyr::filter(pos > start & pos <= end)
#> idstr start end pos
#> 1 1 30 40 40
#> 2 1 20 30 30
#> 3 3 30 50 39
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.