（简单吗？）在 R 中将多年的变化计算为 -1、+1 或 0

Question

我想衡量每项工作任务是否（1）新的，（2）流离失所的，（3）一直存在的。 某年是否存在任务是二进制的（1 或 0）。 我需要的 output 是一个简单的距离测量，必须像：

任务一直存在 (0)
任务已在任何时间点被删除 (-1)
新增任务（+1）

task_id <- c('X001','X002','X003', 'X004')
year2016 <- c(1, 1, 0, 1)
year2017 <- c(1, 0, 0, 1)
year2018 <- c(1, 0, 1, 1)
year2019 <- c(0, 0, 1, 1)
output <- c(-1, -1, 1, 0)

df <- data.frame(task_id, year2016, year2017, year2018, year2019, output)

Output 列必须如下所示：

  task_id year2016 year2017 year2018 year2019 output
1    X001        1        1        1        0     -1
2    X002        1        0        0        0     -1
3    X003        0        0        1        1      1
4    X004        1        1        1        1      0

对我编码有什么建议吗？ 次要补充：实际年份列采用标准日期格式（如果这可能会影响解决方案）。 太感谢了！！

Answer 1

最简单的版本是我们可以忽略行看起来像1, 0, 1, 0或0, 0, 0, 0的情况。 在这种情况下，我们可以使用：

df <- data.frame(task_id, year2016, year2017, year2018, year2019)
df$output <- 0  
df[df$year2016 == 0, ]$output <- 1  
df[df$year2019 == 0, ]$output <- -1

第三行的逻辑是，开始时不存在的那些肯定是在某个时候添加的； 然后我们检查那些在开始时出现但不在结束时出现的，并将它们标记为已被删除。

更复杂情况的逻辑是：

创建一个新列（ num_switches ），计算给定行中从 0 到 1 的翻转次数，反之亦然——这就是rle()的作用
自动将num_switches > 2的任何内容标记为具有output = -2
对于num_switches <= 2的情况，如上标记

完整代码和下面的扩展玩具数据集。 请注意， df子集中的2:5引用应与您的年份列匹配； 在这里要做的更负责任的事情可能是创建一个外部变量来跟踪这些列并在此处引用它（例如，以防您在多年后添加）。

task_id <- c('X001','X002','X003', 'X004', 'X005')
year2016 <- c(1, 1, 0, 1, 1)
year2017 <- c(1, 0, 0, 1, 0)
year2018 <- c(1, 0, 1, 1, 1)
year2019 <- c(0, 0, 1, 1, 0)
# output <- c(-1, -1, 1, 0)

df <- data.frame(task_id, year2016, year2017, year2018, year2019)
df$output <- 0
df$num_switches <- sapply(apply(df[,2:5], 1, function(x) rle(x)$lengths), length)
df[df$num_switches > 2, ]$output <- -2
df[df$year2016 == 0 & df$num_switches <= 2, ]$output <- 1
df[df$year2019 == 0 & df$num_switches <= 2, ]$output <- -1

Answer 2

使用case_when的dplyr解决方案将是：

library(dplyr)
library(tidyr)

df %>% pivot_longer(cols = starts_with("year"),names_to = "year","value") %>%
  group_by(task_id) %>%
  mutate(output2 = case_when(last(value) == 0  ~ -1,
                            last(value) == 1 & sum(value == 0) != 0 ~ 1,
                            sum(value == 0) == 0 ~ 0)) %>%
  pivot_wider(names_from = year, values_from = value)

# A tibble: 4 x 7
# Groups:   task_id [4]
  task_id output output2 year2016 year2017 year2018 year2019
  <fct>    <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 X001        -1      -1        1        1        1        0
2 X002        -1      -1        1        0        0        0
3 X003         1       1        0        0        1        1
4 X004         0       0        1        1        1        1

编辑：更详细的例子

只是为了用@AaronMontgomery 的非常好的答案中描述的更详细的示例来完成答案，这里使用dplyr和case_when的解决方案：

library(dplyr)
library(tidyr)

df %>% pivot_longer(cols = starts_with("year"),names_to = "year","value") %>%
  group_by(task_id) %>%
  mutate(output2 = case_when(last(value) == 0 & length(unlist(rle(value)$length)) >2 ~ -2,
                             last(value) == 0 & length(unlist(rle(value)$length)) <= 2 ~ -1,
                             last(value) == 1 & sum(value == 0) != 0 ~ 1,
                             sum(value == 0) == 0 ~ 0)) %>%
  pivot_wider(names_from = year, values_from = value)

# A tibble: 5 x 6
# Groups:   task_id [5]
  task_id output2 year2016 year2017 year2018 year2019
  <fct>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 X001         -1        1        1        1        0
2 X002         -1        1        0        0        0
3 X003          1        0        0        1        1
4 X004          0        1        1        1        1
5 X005         -2        1        0        1        0

（简单吗？）在 R 中将多年的变化计算为 -1、+1 或 0

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-04-26 17:46:18

解决方案2
2 2020-04-26 17:58:09

（简单吗？）在 R 中将多年的变化计算为 -1、+1 或 0

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-04-26 17:46:18

解决方案2 2 2020-04-26 17:58:09

解决方案1
2 已采纳 2020-04-26 17:46:18

解决方案2
2 2020-04-26 17:58:09