简体   繁体   中英

How to create a new column with the derivative of a set of time serie values

I'm looking for help with R. I want to add three columns to existing data frames that contain time series data and have a lot of NA values. The data is about test scores. The first column I want to add is the first test score available. In the second column, I want the last test score available. In the third column, I want to calculate the derivative for each row by dividing the difference between the first and last scores by the number of tests that have passed. Important is that some of these past tests have NA values but I still want to include these when dividing. However, NA values that come after the last available test score I don't want to count.

Some explanation of my data: A have a couple of data frames that all have test scores of different people. The different people are the rows and each column represents a test score. There are multiple test scores per person for the same test in the data frame. Column T1 shows their first score, T2 the second score, which was gathered a week later, and so on. Some people have started sooner than others and therefore have more test scores available. Also, some scores at the beginning and the middle are missing for various reasons. See the two example tables below where the index column is the actual index of the data frame and not a separate column. Some numbers are missing from the index (like 3) because this person had only NA values in their row, which I removed. It is important for me that the index stays this way.

Example 1 (test A):

INDEX T1 T2 T3 T4 T5 T6
1 NA NA NA 3 4 5
2 57 57 57 57 NA NA
4 44 NA NA NA NA NA
5 9 11 11 17 12 NA

Example 2 (test B):

INDEX T1 T2 T3 T4
1 NA NA NA 17
2 11 16 20 20
4 1 20 NA NA
5 20 20 20 20

My goal now is to add to these data frames the three columns mentioned before. For example 1 this would look like:

INDEX T1 T2 T3 T4 T5 T6 FirstScore LastScore Derivative
1 NA NA NA 3 4 5 3 5 0.33
2 57 57 57 57 NA NA 57 57 0
4 44 NA NA NA NA NA 44 44 0
5 9 11 11 17 12 NA 9 12 0.6

And for example 2:

INDEX T1 T2 T3 T4 FirstScore LastScore Derivative
1 NA NA NA 17 17 17 0
2 11 16 20 20 11 20 2.25
4 1 20 NA NA 1 20 9.5
5 20 20 20 20 20 20 0

I hope I have made myself clear and that someone can help me, thanks in advance!

You could also do:

df1 %>%
   rowwise()%>%
   mutate(firstScore = first(na.omit(c_across(T1:T6))),
          lastScore = last(na.omit(c_across(T1:T6))),
          Derivative = (lastScore-firstScore)/max(which(!is.na(c_across(T1:T6)))))

# A tibble: 4 x 10
# Rowwise: 
  INDEX    T1    T2    T3    T4    T5    T6 firstScore lastScore Derivative
  <int> <int> <int> <int> <int> <int> <int>      <int>     <int>      <dbl>
1     1    NA    NA    NA     3     4     5          3         5      0.333
2     2    57    57    57    57    NA    NA         57        57      0    
3     4    44    NA    NA    NA    NA    NA         44        44      0    
4     5     9    11    11    17    12    NA          9        12      0.6  

Using one pmap_*

pmap_dfr(df1, ~{c(...) %>% t %>% as.data.frame() %>% 
    mutate(first_score = first(na.omit(c(...)[-1])),
           last_score = last(na.omit(c(...)[-1])),
           deriv = (last_score - first_score)/max(which(!is.na(c(...)[-1]))))})

  INDEX T1 T2 T3 T4 T5 T6 first_score last_score     deriv
1     1 NA NA NA  3  4  5           3          5 0.3333333
2     2 57 57 57 57 NA NA          57         57 0.0000000
3     4 44 NA NA NA NA NA          44         44 0.0000000
4     5  9 11 11 17 12 NA           9         12 0.6000000

in dplyr only using cur_data without rowwise() which slows down the operations

df1 %>% group_by(INDEX) %>%
  mutate(first_score = c_across(starts_with('T'))[min(which(!is.na(cur_data())))],
         last_score = c_across(starts_with('T'))[max(which(!is.na(cur_data()[1:6])))],
         deriv = (last_score - first_score)/max(which(!is.na(cur_data()[1:6]))))

I think you can use the following solution. It surprisingly turned out to be a little verbose and convoluted but I think it is quite effective. I assumed that if the Last available score is not actually the last T , so I need to detect its index and divide the difference by it meaning NA values after the last one do not count. Otherwise it is divided by the number of all T s available.

library(dplyr)
library(purrr)

df %>%
  select(T1:T6) %>%
  pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
  exec(rbind, !!!.) %>%
  as_tibble() %>%
  set_names(c("First", "Last")) %>%
  bind_cols(df) %>%
  relocate(First, Last, .after = last_col()) %>%
  rowwise() %>%
  mutate(Derivative = ifelse(!is.na(T6) & T6 == Last, (Last - First)/(length(df)-1), 
                             (Last - First)/last(which(c_across(T1:T6) == Last))))


# First Sample Data
# A tibble: 4 x 10
# Rowwise: 
  INDEX    T1    T2    T3    T4    T5    T6 First  Last Derivative
  <int> <int> <int> <int> <int> <int> <int> <int> <int>      <dbl>
1     1    NA    NA    NA     3     4     5     3     5      0.333
2     2    57    57    57    57    NA    NA    57    57      0    
3     4    44    NA    NA    NA    NA    NA    44    44      0    
4     5     9    11    11    17    12    NA     9    12      0.6  

Second Sample Data

df2 %>%
  select(T1:T4) %>%
  pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
  exec(rbind, !!!.) %>%
  as_tibble() %>%
  set_names(c("First", "Last")) %>%
  bind_cols(df2) %>%
  relocate(First, Last, .after = last_col()) %>%
  rowwise() %>%
  mutate(Derivative = ifelse(!is.na(T4) & T4 == Last, (Last - First)/(length(df2)-1), 
                             (Last - First)/last(which(c_across(T1:T4) == Last))))

# A tibble: 4 x 8
# Rowwise: 
  INDEX    T1    T2    T3    T4 First  Last Derivative
  <int> <int> <int> <int> <int> <int> <int>      <dbl>
1     1    NA    NA    NA    17    17    17       0   
2     2    11    16    20    20    11    20       2.25
3     4     1    20    NA    NA     1    20       9.5 
4     5    20    20    20    20    20    20       0  

Here's a tidyverse solution with no hardcoding. First I pivot longer, then extract the stats for each INDEX.

library(tidyverse)
df1 %>%
  pivot_longer(-INDEX, names_to = "time", names_prefix = "T", names_transform = list(time = as.integer)) %>%
  filter(!is.na(value)) %>%
  group_by(INDEX) %>%
  summarize(FirstScore = first(value), LastScore = last(value), divisor = max(time)) %>%
  mutate(Derivative = (LastScore - FirstScore) / divisor) %>%
  right_join(df1) %>%
  select(INDEX, T1:T6, FirstScore, LastScore, Derivative)

for this output:

# A tibble: 4 x 10
  INDEX    T1    T2    T3    T4    T5    T6 FirstScore LastScore Derivative
  <int> <int> <int> <int> <int> <int> <int>      <int>     <int>      <dbl>
1     1    NA    NA    NA     3     4     5          3         5      0.333
2     2    57    57    57    57    NA    NA         57        57      0    
3     4    44    NA    NA    NA    NA    NA         44        44      0    
4     5     9    11    11    17    12    NA          9        12      0.6  

Output for 2nd data, with no changes to the code:

# A tibble: 4 x 10
  INDEX    T1    T2    T3    T4    T5    T6 FirstScore LastScore Derivative
  <int> <int> <int> <int> <int> <int> <int>      <int>     <int>      <dbl>
1     1    NA    NA    NA     3     4     5         17        17       0   
2     2    57    57    57    57    NA    NA         11        20       2.25
3     4    44    NA    NA    NA    NA    NA          1        20       9.5 
4     5     9    11    11    17    12    NA         20        20       0   

Sample data

df1 <- data.frame(
       INDEX = c(1L, 2L, 4L, 5L),
          T1 = c(NA, 57L, 44L, 9L),
          T2 = c(NA, 57L, NA, 11L),
          T3 = c(NA, 57L, NA, 11L),
          T4 = c(3L, 57L, NA, 17L),
          T5 = c(4L, NA, NA, 12L),
          T6 = c(5L, NA, NA, NA)
)

df2 <- data.frame(
       INDEX = c(1L, 2L, 4L, 5L),
          T1 = c(NA, 11L, 1L, 20L),
          T2 = c(NA, 16L, 20L, 20L),
          T3 = c(NA, 20L, NA, 20L),
          T4 = c(17L, 20L, NA, 20L)
       )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM