简体   繁体   English

根据 R 中的滞后值过滤条件

[英]Filtering conditional on lag values in R

df is a dataframe where each row is a pair of items (from item1 & item2 ). df是 dataframe ,其中每一行是一对项目(来自item1item2 )。

I want to keep the 1st row of the dataframe, and then keep only the 1st rows where the previous value of item2 is the current value of item1 .我想保留 dataframe 的第一行,然后只保留item2的前一个值是item1的当前值的第一行。 So I except my data to look like output .所以我除了我的数据看起来像output

I would prefer a tidy (or purrr ) way of doing so but open to any suggestions.我更喜欢这样做的tidy (或purrr )方式,但愿意接受任何建议。

df <- structure(list(item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L), 
  item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 
  6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)), row.names = c(NA, 
  -24L), class = c("tbl_df", "tbl", "data.frame"))
df
#>    item1 item2
#> 1      1     4
#> 2      1     5
#> 3      1     6
#> 4      1     7
#> 5      1     8
#> 6      2     4
#> 7      2     5
#> 8      2     6
#> 9      2     7
#> 10     2     8
#> 11     3     4
#> 12     3     5
#> 13     3     6
#> 14     3     7
#> 15     3     8
#> 16     4     5
#> 17     4     6
#> 18     4     7
#> 19     4     8
#> 20     5     7
#> 21     5     8
#> 22     6     7
#> 23     6     8
#> 24     7     8

output <- data.frame(item1 = c(1,4,5,7),
           item2 = c(4,5,7,8))
output
#>   item1 item2
#> 1     1     4
#> 2     4     5
#> 3     5     7
#> 4     7     8

Created on 2022-09-22 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 9 月 22 日创建

Here's a solution using the tidyverse.这是使用 tidyverse 的解决方案。

Using a lag(..., default = 1) ensures we also output the first row.使用lag(..., default = 1)确保我们还在第一行 output 。

library(tidyverse)

df <- tibble(
  item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L), 
  item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)
)

df %>%
  group_by(item1) %>%
  summarize(item2 = first(item2)) %>%
  filter(item1 == lag(item2, default = 1))
#> # A tibble: 4 × 2
#>   item1 item2
#>   <int> <int>
#> 1     1     4
#> 2     4     5
#> 3     5     7
#> 4     7     8

Created on 2022-09-22 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 9 月 22 日创建

This is probably not what you were looking for (not a very tidy solution), but it yilds the desired output.这可能不是您要寻找的(不是一个非常整洁的解决方案),但它产生了所需的 output。

library(tidyverse)

df <- data.frame(
  item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
            2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L), 
  item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 
            6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)
)

my_filter <- function(df_to_find, df_orig){
  value_to_find <- tail(df_to_find, 1)$item2
  df_found <- df_orig %>%
    filter(item1 == value_to_find) %>%
    head(1)
  
  if(nrow(df_found) > 0){
    # if something found, recall this function
    # with the newly found data appended to the old results
    return(Recall(bind_rows(df_to_find, df_found), df_orig))
  } else{
    # once you reach a state when nothing else is found return the results so far
    # this is called recursion in programming
    return(bind_rows(df_to_find))
  }
  
}

Created on 2022-09-22 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 9 月 22 日创建

This won't be directly vectorizable--I would do it with a simple for loop.这不会直接矢量化——我会用一个简单的 for 循环来做到这一点。 This will almost certainly be faster than a recursive solution for any sizable data.对于任何大量数据,这几乎肯定会比递归解决方案更快。

keep = logical(length = nrow(df)) 
keep[1] = TRUE
target = df$item2[1]
for(i in 2:nrow(df)) {
  if(df$item1[i] == target) {
    keep[i] = TRUE
    target = df$item2[i]
  }
}
result = df[keep, ]
result
# # A tibble: 4 × 2
#   item1 item2
#   <int> <int>
# 1     1     4
# 2     4     5
# 3     5     7
# 4     7     8

Here is another untidy and recursive solution:这是另一个不整洁且递归的解决方案:

last2current = function (x) {
  first = x[1, ]
  first_match = with(x, match(item2[1], item1))
  if (is.na(first_match)) return(first)
  other = x[first_match:nrow(x), ]
  rbind(first, last2current(other))
}

last2current(df)
   item1 item2
1      1     4
16     4     5
20     5     7
24     7     8

Explanation:解释:

This is a recursive function, this meaning that it calls itself .这是一个递归的 function,这意味着它调用了自己 It stores the first row, then looks for the first match of item2[1] on item1 and stores the row number in first_match .它存储第一行,然后在item1item2[1]的第一个匹配项,并将行号存储在first_match中。 If there is no first_match it means we are done, so return() .如果没有first_match这意味着我们完成了,所以return() If there is a match then it does the same procedure on the rows from the first match to the end of the data frame.如果有匹配,那么它对从第一个匹配到数据帧末尾的行执行相同的过程。 Finally it cbind s all the rows.最后它cbind所有行。

Note that this will fail if there is a row where item1 == item2 since item1[1] is included in match .请注意,如果存在item1 == item2的行,因为item1[1]包含在match ,这将失败。

A base R recursion:一个基本的 R 递归:

relation <- function(df, row){
  if(is.na(row)) head(row, -1)
  else c(row, relation(df, match(df[row, 2], df[,1]))) 
}

# Starting at row 1
df[relation(df, 1), ]

  item1 item2
1      1     4
16     4     5
20     5     7
24     7     8

# Starting at row 2
df[relation(df, 2), ]
   item1 item2
2      1     5
20     5     7
24     7     8

# Starting at row 4
df[relation(df, 4), ]
   item1 item2
4      1     7
24     7     8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM