简体   繁体   中英

Creating a binary variable based on the maximum of another variable by group using R

I want to create a new binary column ( choice ) that takes a number one in the maximum of variable U by id_choice and zero in other cases.

Take this sample data for example:

 sample_df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), altern = c(1L,2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), time = c(0.60622924522549, 0.685763204423431,1.04445466206904, 2.0823687526597, 0.470385492467578, 0.278410094130233,4.3933007737356, 1.30150082775573, 0.164433239189492), cost = c(0.775815897061855,3.65632847698275, 0.853480119066832, 4.18372276257574, 0.386247047617908,0.0499751011513356, 0.50605264042165, 0.309115653465334, 1.63340498409165), id_choice = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), U = c(-0.384172837567259,0.912405259429594, -0.00977885942620305, -1.82630532041359, -0.228713211633138,1.77768082832823, -1.7172001044961, -0.0197827158096625, 0.3408726361911)), row.names = c(NA, 9L), class = "data.frame") 

id altern      time      cost id_choice            U
1  1      1 0.6062292 0.7758159         1 -0.384172838
2  1      2 0.6857632 3.6563285         1  0.912405259
3  1      3 1.0444547 0.8534801         1 -0.009778859
4  2      1 2.0823688 4.1837228         2 -1.826305320
5  2      2 0.4703855 0.3862470         2 -0.228713212
6  2      3 0.2784101 0.0499751         2  1.777680828
7  3      1 4.3933008 0.5060526         3 -1.717200104
8  3      2 1.3015008 0.3091157         3 -0.019782716
9  3      3 0.1644332 1.6334050         3  0.340872636

For now, what I did is in the following lines:

  1. First, I iterate over the rows (this is the slow part) to get the maximum value of U by id_choice .
  2. Second, I generate the binary variable using ifelse in order to identify which alternative is selected.
# First: Geting the maximum value of utility (U)
for (i in 1:max(sample_df$id_choice)) {
  sample_df$choice[sample_df$id_choice==i]<-which.max(sample_df$U[sample_df$id_choice==i])
}

# Second: Generating the binary output for the choice decision
sample_df$choice<-ifelse(sample_df$altern==sample_df$choice,1,0)

As a result, for example, the first individual (first three observations) get a number 1 in choice when U is equal to 0.912405259 . The second individual gets a number 1 in choice when U is equal to 1.777680828 , etc.

id altern      time      cost id_choice            U choice
1  1      1 0.6062292 0.7758159         1 -0.384172838      0
2  1      2 0.6857632 3.6563285         1  0.912405259      1
3  1      3 1.0444547 0.8534801         1 -0.009778859      0
4  2      1 2.0823688 4.1837228         2 -1.826305320      0
5  2      2 0.4703855 0.3862470         2 -0.228713212      0
6  2      3 0.2784101 0.0499751         2  1.777680828      1
7  3      1 4.3933008 0.5060526         3 -1.717200104      0
8  3      2 1.3015008 0.3091157         3 -0.019782716      0
9  3      3 0.1644332 1.6334050         3  0.340872636      1

As a side note, I am generating data to run some simulations to estimate a multinomial logit (or conditional logit) but the described part of the code is really time-consuming because it is written using a loop over observations, which I know is strongly advised against it, this is why I would like to ask if someone could come up with a vectorized way to perform this operation. Many thanks in advance!

You could try the following:

  id_choice_split <- split(sample_df$U,sample_df$id_choice)
  sample_df$choice <- unlist(lapply(id_choice_split, function(uValues) as.numeric(uValues == max(uValues))))
  sample_df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM