简体   繁体   中英

How to create a new variable that takes the first observation of a different variable?

Suppose I have a data frame with three variables as below. How do I create a new variable that for each group takes the first observation of x ?

group  year   x     
1      2000   3    
1      2001   4    
2      2000   1    
2      2001   3    
3      2000   5     
3      2001   2    

I want to to create something like this:

group  year   x    y 
1      2000   3    3
1      2001   4    3
2      2000   1    1
2      2001   3    1
3      2000   5    5 
3      2001   2    5

Set up data for example:

dd <- data.frame(group=rep(1:3,each=2),
                year=rep(2000:2001,3),
                x=c(3,4,1,3,5,2))

In base R, use ave() . By default this finds the group average (rather than the first value), but we can use the FUN argument to ask it to select the first value instead.

dd$y <- ave(dd$x, dd$group, FUN=function(x) x[1])
## or
dd <- transform(dd,y=ave(x, group, FUN=function(x) x[1])

(alternatively could use FUN=function(x) head(x,1) )

In tidyverse,

library(dplyr)
dd <- dd %>%
        group_by(group) %>%
        mutate(y=first(x))

@lmo points out another alternative in comments:

library(data.table)
setDT(dd)[, y := first(x), by=group]

You can find nearly endless discussion of the relative merits of these three major approaches (base R vs tidyverse vs data.table) elsewhere (on StackOverflow and on the interwebs generally).

Using package plyr :

df <- data.frame(group=c(1,1,2,2,3,3), 
                 year=c(2000,2001,2000,2001,2000,2001), 
                 x=c(3,4,1,3,5,2))

library(plyr)
ddply(df, .(group), transform, y=x[1])

A simple version in base R

### Your data
df = read.table(text="group  year   x     
1      2000   3    
1      2001   4    
2      2000   1    
2      2001   3    
3      2000   5     
3      2001   2",
header=TRUE)

df$y = aggregate(as.numeric(row.names(df)), list(df$group), min)$x[df$group]
df
  group year x y
1     1 2000 3 1
2     1 2001 4 1
3     2 2000 1 3
4     2 2001 3 3
5     3 2000 5 5
6     3 2001 2 5

Here's yet another way, using base R:

dd <- data.frame(group = rep(1:3, each = 2),
                 year = rep(2000:2001, 3),
                 x = c(3, 4, 1, 3, 5, 2))

transform(dd, y = unsplit(tapply(x, group, function(x) x[1]), group))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM