简体   繁体   中英

Using diff() in R ignoring NA

I have a R DataFrame df with the following content:

Serial N         year         current
   B              10            14
   B              10            16
   B              11            10
   B              11            NA
   B              11            15
   C              12            11
   C              12             9
   C              12            13
   C              12            17
   .              .              .

I would like to find the difference between the each consecutive pair of current of the same serial N. This is code I wrote.But I am getting some strange results

library(data.table)
setDT(df)[,mydiff:=diff(df$current),by=Serial N]   
    print(length(df$current))

I get the following as outuput for that column is quite strange, I get this:

2 6  NA NA NA 2 6  NA NA NA 

What I would like to have actually is :

Serial N         year         current      mydiff
   B              10            14         
   B              10            16         16-14=2
   B              11            10         10-16=-4
   B              11            NA            NA
   B              11            15         15-10=5
   C              12            11
   C              12             9         9-11=-2    
   C              12           -13        -13-9=-22
   C              12            17         17-(-13)=30
   .              .              .

Is diff the right thing to do that? if not, how can tackle this (especially without using loops)?

By applying

aggregate(current ~ Serial.N ,df1, diff)

one obtains

  Serial.N current.1 current.2 current.3
1        B         2        -6         5
2        C        -2         4         4

which corresponds to

B:    16 - 14 =  2
      10 - 16 = -6
      15 - 10 =  5
C:     9 - 11 = -2
      13 -  9 =  4
      17 - 13 =  4

So the output of diff() combined with aggregate() seems to make sense to me. I may not have understood exactly why you expect the output that you describe.


Edit

If the third entry in Serial N C of current is -13 and not 13 (the data in the OP is contradictory) the result is

aggregate(current ~ Serial.N ,df1, diff)
#   Serial.N current.1 current.2 current.3
# 1        B         2        -6         5
# 2        C        -2       -22        30

which seems to be closer to the desired output.


Edit 2

To add a column mydiff to the data.frame that takes the difference between consecutive values of the same Serial N while ignoring NA values we could use

df1$mydiff <- with(df1, ave(current, Serial.N, 
                   FUN = function(x) c(NA, diff(na.omit(x)))))

This will lead to a warning ("...not a multiple of replacement length"), but the result will be close to the expected output:

#  Serial.N year current mydiff
#1        B   10      14     NA
#2        B   10      16      2
#3        B   11      10     -6
#4        B   11      NA      5
#5        B   11      15     NA
#6        C   12      11     NA
#7        C   12       9     -2
#8        C   12     -13    -22
#9        C   12      17     30

The values in the mydiff column are correct, but one of the NA values is missing (in row 4). That is because we cannot ignore the NA s and at the same time preserve them; at least not without a significant manipulation of the data.frame .

Hope this helps.


data

df1 <- structure(list(Serial.N = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 
        2L, 2L, 2L), .Label = c("B", "C"), class = "factor"), year = c(10L, 
        10L, 11L, 11L, 11L, 12L, 12L, 12L, 12L), current = c(14L, 16L, 
        10L, NA, 15L, 11L, 9L, -13L, 17L)), .Names = c("Serial.N", "year", 
        "current"), class = "data.frame", row.names = c(NA, -9L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM