[英]Filtering conditional on lag values in R
df
is a dataframe where each row is a pair of items (from item1
& item2
). df
是 dataframe ,其中每一行是一对项目(来自item1
和item2
)。
I want to keep the 1st row of the dataframe, and then keep only the 1st rows where the previous value of item2
is the current value of item1
.我想保留 dataframe 的第一行,然后只保留
item2
的前一个值是item1
的当前值的第一行。 So I except my data to look like output
.所以我除了我的数据看起来像
output
。
I would prefer a tidy
(or purrr
) way of doing so but open to any suggestions.我更喜欢这样做的
tidy
(或purrr
)方式,但愿意接受任何建议。
df <- structure(list(item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L),
item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L,
6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)), row.names = c(NA,
-24L), class = c("tbl_df", "tbl", "data.frame"))
df
#> item1 item2
#> 1 1 4
#> 2 1 5
#> 3 1 6
#> 4 1 7
#> 5 1 8
#> 6 2 4
#> 7 2 5
#> 8 2 6
#> 9 2 7
#> 10 2 8
#> 11 3 4
#> 12 3 5
#> 13 3 6
#> 14 3 7
#> 15 3 8
#> 16 4 5
#> 17 4 6
#> 18 4 7
#> 19 4 8
#> 20 5 7
#> 21 5 8
#> 22 6 7
#> 23 6 8
#> 24 7 8
output <- data.frame(item1 = c(1,4,5,7),
item2 = c(4,5,7,8))
output
#> item1 item2
#> 1 1 4
#> 2 4 5
#> 3 5 7
#> 4 7 8
Created on 2022-09-22 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 9 月 22 日创建
Here's a solution using the tidyverse.这是使用 tidyverse 的解决方案。
Using a lag(..., default = 1)
ensures we also output the first row.使用
lag(..., default = 1)
确保我们还在第一行 output 。
library(tidyverse)
df <- tibble(
item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L),
item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)
)
df %>%
group_by(item1) %>%
summarize(item2 = first(item2)) %>%
filter(item1 == lag(item2, default = 1))
#> # A tibble: 4 × 2
#> item1 item2
#> <int> <int>
#> 1 1 4
#> 2 4 5
#> 3 5 7
#> 4 7 8
Created on 2022-09-22 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 9 月 22 日创建
This is probably not what you were looking for (not a very tidy solution), but it yilds the desired output.这可能不是您要寻找的(不是一个非常整洁的解决方案),但它产生了所需的 output。
library(tidyverse)
df <- data.frame(
item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L),
item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L,
6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)
)
my_filter <- function(df_to_find, df_orig){
value_to_find <- tail(df_to_find, 1)$item2
df_found <- df_orig %>%
filter(item1 == value_to_find) %>%
head(1)
if(nrow(df_found) > 0){
# if something found, recall this function
# with the newly found data appended to the old results
return(Recall(bind_rows(df_to_find, df_found), df_orig))
} else{
# once you reach a state when nothing else is found return the results so far
# this is called recursion in programming
return(bind_rows(df_to_find))
}
}
Created on 2022-09-22 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 9 月 22 日创建
This won't be directly vectorizable--I would do it with a simple for loop.这不会直接矢量化——我会用一个简单的 for 循环来做到这一点。 This will almost certainly be faster than a recursive solution for any sizable data.
对于任何大量数据,这几乎肯定会比递归解决方案更快。
keep = logical(length = nrow(df))
keep[1] = TRUE
target = df$item2[1]
for(i in 2:nrow(df)) {
if(df$item1[i] == target) {
keep[i] = TRUE
target = df$item2[i]
}
}
result = df[keep, ]
result
# # A tibble: 4 × 2
# item1 item2
# <int> <int>
# 1 1 4
# 2 4 5
# 3 5 7
# 4 7 8
Here is another untidy and recursive solution:这是另一个不整洁且递归的解决方案:
last2current = function (x) {
first = x[1, ]
first_match = with(x, match(item2[1], item1))
if (is.na(first_match)) return(first)
other = x[first_match:nrow(x), ]
rbind(first, last2current(other))
}
last2current(df)
item1 item2
1 1 4
16 4 5
20 5 7
24 7 8
This is a recursive function, this meaning that it calls itself .这是一个递归的 function,这意味着它调用了自己。 It stores the first row, then looks for the first match of
item2[1]
on item1
and stores the row number in first_match
.它存储第一行,然后在
item1
上item2[1]
的第一个匹配项,并将行号存储在first_match
中。 If there is no first_match
it means we are done, so return()
.如果没有
first_match
这意味着我们完成了,所以return()
。 If there is a match then it does the same procedure on the rows from the first match to the end of the data frame.如果有匹配,那么它对从第一个匹配到数据帧末尾的行执行相同的过程。 Finally it
cbind
s all the rows.最后它
cbind
所有行。
Note that this will fail if there is a row where item1 == item2
since item1[1]
is included in match
.请注意,如果存在
item1 == item2
的行,因为item1[1]
包含在match
,这将失败。
A base R recursion:一个基本的 R 递归:
relation <- function(df, row){
if(is.na(row)) head(row, -1)
else c(row, relation(df, match(df[row, 2], df[,1])))
}
# Starting at row 1
df[relation(df, 1), ]
item1 item2
1 1 4
16 4 5
20 5 7
24 7 8
# Starting at row 2
df[relation(df, 2), ]
item1 item2
2 1 5
20 5 7
24 7 8
# Starting at row 4
df[relation(df, 4), ]
item1 item2
4 1 7
24 7 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.