简体   繁体   中英

Adding a base year index to R dataframe with multiple groups

I have a yearly time series dataframe with few grouping variables and I need to add an index column that is based on a particular year.

df <- data.frame(YEAR = c(2000,2001,2002,2000,2001,2002), 
                 GRP = c("A","A","A","B","B","B"),
                 VAL = sample(6))

I want to make a simple index of variable VAL that is the value divided with the value of the base year, say 2000:

df$VAL.IND <- df$VAL/df$VAL[df$YEAR == 2000]

This is not right as it does not respect the grouping variable GRP. I tried with plyr but I could not make it work.

In my actual problem I have several grouping variables with varying time series and thus I'm looking for a quite general solution.

We can create the 'VAL.IND' after doing the calculation within the grouping variable ('GRP'). This can be done in many ways.

One option is data.table where we create 'data.table' from 'data.frame' ( setDT(df) ), Grouped by 'GRP', we divide the 'VAL' by the 'VAL' that corresponds to 'YEAR' value of 2000.

 library(data.table)
 setDT(df)[, VAL.IND := VAL/VAL[YEAR==2000], by = GRP]

NOTE: The base YEAR is a bit confusing wrt to the result. In the example, both the 'A' and 'B' GRP have 'YEAR' 2000. Suppose, if the OP meant to use the minimum YEAR value (considering that it is numeric column), VAL/VAL[YEAR==2000] in the above code can be replaced with VAL/VAL[which.min(YEAR)] .


Or you can use a similar code with dplyr . We group by 'GRP' and use mutate to create the 'VAL.IND'

 library(dplyr)
 df %>%
    group_by(GRP) %>%
    mutate(VAL.IND = VAL/VAL[YEAR==2000])

Here also, if we needed replace VAL/VAL[YEAR==2000] with VAL/VAL[which.min(YEAR)]


A base R option with split/unsplit . We split the dataset by the 'GRP' column to convert the data.frame to a list of dataframes, loop through the list output with lapply , create a new column using transform (or within ) and convert the list with the added column back to a single data.frame by unsplit .

  unsplit(lapply(split(df, df$GRP), function(x) 
          transform(x, VAL.IND= VAL/VAL[YEAR==2000])), df$GRP)

Note that we can also use do.call(rbind instead of unsplit . But, I prefer unsplit to get the same row order as the original dataset.

Here's another base R approach built around by() :

df$VAL.IND <- do.call(c,by(df,df$GRP,function(g) g$VAL/g$VAL[which.min(g$YEAR)]));
df;
##   YEAR GRP VAL   VAL.IND
## 1 2000   A   3 1.0000000
## 2 2001   A   1 0.3333333
## 3 2002   A   2 0.6666667
## 4 2000   B   6 1.0000000
## 5 2001   B   5 0.8333333
## 6 2002   B   4 0.6666667

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM