简体   繁体   English

如何使用一组时间序列值的导数创建新列

[英]How to create a new column with the derivative of a set of time serie values

I'm looking for help with R.我正在寻求有关 R 的帮助。 I want to add three columns to existing data frames that contain time series data and have a lot of NA values.我想将三列添加到包含时间序列数据并具有大量 NA 值的现有数据框中。 The data is about test scores.数据是关于考试成绩的。 The first column I want to add is the first test score available.我要添加的第一列是可用的第一个测试分数。 In the second column, I want the last test score available.在第二列中,我想要最后的测试分数。 In the third column, I want to calculate the derivative for each row by dividing the difference between the first and last scores by the number of tests that have passed.在第三列中,我想通过将第一个和最后一个分数之间的差异除以已通过的测试数来计算每一行的导数。 Important is that some of these past tests have NA values but I still want to include these when dividing.重要的是这些过去的一些测试有 NA 值,但我仍然想在划分时包括这些值。 However, NA values that come after the last available test score I don't want to count.但是,我不想计算最后一个可用测试分数之后的 NA 值。

Some explanation of my data: A have a couple of data frames that all have test scores of different people.对我的数据的一些解释: A 有几个数据框,它们都有不同人的测试分数。 The different people are the rows and each column represents a test score.不同的人是行,每列代表一个测试分数。 There are multiple test scores per person for the same test in the data frame.对于数据框中的同一测试,每个人有多个测试分数。 Column T1 shows their first score, T2 the second score, which was gathered a week later, and so on. T1 列显示他们的第一个分数,T2 列显示一周后收集的第二个分数,依此类推。 Some people have started sooner than others and therefore have more test scores available.有些人比其他人开始得早,因此可以获得更多的考试成绩。 Also, some scores at the beginning and the middle are missing for various reasons.此外,由于各种原因,开头和中间的一些分数缺失。 See the two example tables below where the index column is the actual index of the data frame and not a separate column.请参阅下面的两个示例表,其中索引列是数据框的实际索引,而不是单独的列。 Some numbers are missing from the index (like 3) because this person had only NA values in their row, which I removed.索引中缺少一些数字(如 3),因为此人的行中只有 NA 值,我将其删除。 It is important for me that the index stays this way.索引保持这种状态对我来说很重要。

Example 1 (test A):示例 1(测试 A):

INDEX指数 T1 T1 T2 T2 T3 T3 T4 T4 T5 T5 T6 T6
1 1 NA不适用 NA不适用 NA不适用 3 3 4 4 5 5
2 2 57 57 57 57 57 57 57 57 NA不适用 NA不适用
4 4 44 44 NA不适用 NA不适用 NA不适用 NA不适用 NA不适用
5 5 9 9 11 11 11 11 17 17 12 12 NA不适用

Example 2 (test B):示例 2(测试 B):

INDEX指数 T1 T1 T2 T2 T3 T3 T4 T4
1 1 NA不适用 NA不适用 NA不适用 17 17
2 2 11 11 16 16 20 20 20 20
4 4 1 1 20 20 NA不适用 NA不适用
5 5 20 20 20 20 20 20 20 20

My goal now is to add to these data frames the three columns mentioned before.我现在的目标是将前面提到的三列添加到这些数据框中。 For example 1 this would look like:例如 1 这看起来像:

INDEX指数 T1 T1 T2 T2 T3 T3 T4 T4 T5 T5 T6 T6 FirstScore第一分数 LastScore最后得分 Derivative衍生物
1 1 NA不适用 NA不适用 NA不适用 3 3 4 4 5 5 3 3 5 5 0.33 0.33
2 2 57 57 57 57 57 57 57 57 NA不适用 NA不适用 57 57 57 57 0 0
4 4 44 44 NA不适用 NA不适用 NA不适用 NA不适用 NA不适用 44 44 44 44 0 0
5 5 9 9 11 11 11 11 17 17 12 12 NA不适用 9 9 12 12 0.6 0.6

And for example 2:例如2:

INDEX指数 T1 T1 T2 T2 T3 T3 T4 T4 FirstScore第一分数 LastScore最后得分 Derivative衍生物
1 1 NA不适用 NA不适用 NA不适用 17 17 17 17 17 17 0 0
2 2 11 11 16 16 20 20 20 20 11 11 20 20 2.25 2.25
4 4 1 1 20 20 NA不适用 NA不适用 1 1 20 20 9.5 9.5
5 5 20 20 20 20 20 20 20 20 20 20 20 20 0 0

I hope I have made myself clear and that someone can help me, thanks in advance!我希望我已经说清楚了,有人可以帮助我,在此先感谢!

You could also do:你也可以这样做:

df1 %>%
   rowwise()%>%
   mutate(firstScore = first(na.omit(c_across(T1:T6))),
          lastScore = last(na.omit(c_across(T1:T6))),
          Derivative = (lastScore-firstScore)/max(which(!is.na(c_across(T1:T6)))))

# A tibble: 4 x 10
# Rowwise: 
  INDEX    T1    T2    T3    T4    T5    T6 firstScore lastScore Derivative
  <int> <int> <int> <int> <int> <int> <int>      <int>     <int>      <dbl>
1     1    NA    NA    NA     3     4     5          3         5      0.333
2     2    57    57    57    57    NA    NA         57        57      0    
3     4    44    NA    NA    NA    NA    NA         44        44      0    
4     5     9    11    11    17    12    NA          9        12      0.6  

Using one pmap_*使用一个 pmap_*

pmap_dfr(df1, ~{c(...) %>% t %>% as.data.frame() %>% 
    mutate(first_score = first(na.omit(c(...)[-1])),
           last_score = last(na.omit(c(...)[-1])),
           deriv = (last_score - first_score)/max(which(!is.na(c(...)[-1]))))})

  INDEX T1 T2 T3 T4 T5 T6 first_score last_score     deriv
1     1 NA NA NA  3  4  5           3          5 0.3333333
2     2 57 57 57 57 NA NA          57         57 0.0000000
3     4 44 NA NA NA NA NA          44         44 0.0000000
4     5  9 11 11 17 12 NA           9         12 0.6000000

in dplyr only using cur_data without rowwise() which slows down the operationsdplyr中,仅使用没有cur_data rowwise()的 cur_data 会减慢操作速度

df1 %>% group_by(INDEX) %>%
  mutate(first_score = c_across(starts_with('T'))[min(which(!is.na(cur_data())))],
         last_score = c_across(starts_with('T'))[max(which(!is.na(cur_data()[1:6])))],
         deriv = (last_score - first_score)/max(which(!is.na(cur_data()[1:6]))))

I think you can use the following solution.我认为您可以使用以下解决方案。 It surprisingly turned out to be a little verbose and convoluted but I think it is quite effective.令人惊讶的是,它有点冗长和令人费解,但我认为它非常有效。 I assumed that if the Last available score is not actually the last T , so I need to detect its index and divide the difference by it meaning NA values after the last one do not count.我假设如果Last available score 实际上不是最后一个T ,那么我需要检测它的索引并将差异除以它,这意味着最后一个之后的NA值不计算在内。 Otherwise it is divided by the number of all T s available.否则,它除以所有可用T的数量。

library(dplyr)
library(purrr)

df %>%
  select(T1:T6) %>%
  pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
  exec(rbind, !!!.) %>%
  as_tibble() %>%
  set_names(c("First", "Last")) %>%
  bind_cols(df) %>%
  relocate(First, Last, .after = last_col()) %>%
  rowwise() %>%
  mutate(Derivative = ifelse(!is.na(T6) & T6 == Last, (Last - First)/(length(df)-1), 
                             (Last - First)/last(which(c_across(T1:T6) == Last))))


# First Sample Data
# A tibble: 4 x 10
# Rowwise: 
  INDEX    T1    T2    T3    T4    T5    T6 First  Last Derivative
  <int> <int> <int> <int> <int> <int> <int> <int> <int>      <dbl>
1     1    NA    NA    NA     3     4     5     3     5      0.333
2     2    57    57    57    57    NA    NA    57    57      0    
3     4    44    NA    NA    NA    NA    NA    44    44      0    
4     5     9    11    11    17    12    NA     9    12      0.6  

Second Sample Data第二个样本数据

df2 %>%
  select(T1:T4) %>%
  pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
  exec(rbind, !!!.) %>%
  as_tibble() %>%
  set_names(c("First", "Last")) %>%
  bind_cols(df2) %>%
  relocate(First, Last, .after = last_col()) %>%
  rowwise() %>%
  mutate(Derivative = ifelse(!is.na(T4) & T4 == Last, (Last - First)/(length(df2)-1), 
                             (Last - First)/last(which(c_across(T1:T4) == Last))))

# A tibble: 4 x 8
# Rowwise: 
  INDEX    T1    T2    T3    T4 First  Last Derivative
  <int> <int> <int> <int> <int> <int> <int>      <dbl>
1     1    NA    NA    NA    17    17    17       0   
2     2    11    16    20    20    11    20       2.25
3     4     1    20    NA    NA     1    20       9.5 
4     5    20    20    20    20    20    20       0  

Here's a tidyverse solution with no hardcoding.这是一个没有硬编码的 tidyverse 解决方案。 First I pivot longer, then extract the stats for each INDEX.首先我 pivot 更长,然后提取每个 INDEX 的统计数据。

library(tidyverse)
df1 %>%
  pivot_longer(-INDEX, names_to = "time", names_prefix = "T", names_transform = list(time = as.integer)) %>%
  filter(!is.na(value)) %>%
  group_by(INDEX) %>%
  summarize(FirstScore = first(value), LastScore = last(value), divisor = max(time)) %>%
  mutate(Derivative = (LastScore - FirstScore) / divisor) %>%
  right_join(df1) %>%
  select(INDEX, T1:T6, FirstScore, LastScore, Derivative)

for this output:对于这个 output:

# A tibble: 4 x 10
  INDEX    T1    T2    T3    T4    T5    T6 FirstScore LastScore Derivative
  <int> <int> <int> <int> <int> <int> <int>      <int>     <int>      <dbl>
1     1    NA    NA    NA     3     4     5          3         5      0.333
2     2    57    57    57    57    NA    NA         57        57      0    
3     4    44    NA    NA    NA    NA    NA         44        44      0    
4     5     9    11    11    17    12    NA          9        12      0.6  

Output for 2nd data, with no changes to the code: Output 用于第二个数据,代码不变:

# A tibble: 4 x 10
  INDEX    T1    T2    T3    T4    T5    T6 FirstScore LastScore Derivative
  <int> <int> <int> <int> <int> <int> <int>      <int>     <int>      <dbl>
1     1    NA    NA    NA     3     4     5         17        17       0   
2     2    57    57    57    57    NA    NA         11        20       2.25
3     4    44    NA    NA    NA    NA    NA          1        20       9.5 
4     5     9    11    11    17    12    NA         20        20       0   

Sample data样本数据

df1 <- data.frame(
       INDEX = c(1L, 2L, 4L, 5L),
          T1 = c(NA, 57L, 44L, 9L),
          T2 = c(NA, 57L, NA, 11L),
          T3 = c(NA, 57L, NA, 11L),
          T4 = c(3L, 57L, NA, 17L),
          T5 = c(4L, NA, NA, 12L),
          T6 = c(5L, NA, NA, NA)
)

df2 <- data.frame(
       INDEX = c(1L, 2L, 4L, 5L),
          T1 = c(NA, 11L, 1L, 20L),
          T2 = c(NA, 16L, 20L, 20L),
          T3 = c(NA, 20L, NA, 20L),
          T4 = c(17L, 20L, NA, 20L)
       )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM