简体   繁体   中英

Creating a new variable based on the order of values in another variable in R

Let's say I have data on two firms that are each expanding to new cities. I have the dates of each additional store added in each city:

df <- data.frame(
    firm = c(rep(1,4), rep(2,4)),
    date = as.Date(c('2017-01-01', '2017-03-01', '2017-05-01',
                     '2017-06-01', '2017-02-01', '2017-04-01',
                     '2017-05-01', '2017-06-01')),
    city = c('New York', 'DC', 'New York', 'Atlanta', 'DC', 'DC',
             'Chicago', 'Atlanta'),
    numStores = c(1, 1, 2, 1, 1, 2, 1, 1))

I want to create a column of integers that tells me the order that each firm entered each city based on the date variable. The result should be this:

df$cityOrder = c(1,2,1,3,1,1,2,3)

Using dplyr ...

library(dplyr)
df %>% left_join(df %>% filter(numStores==1) %>% 
                        group_by(firm) %>% 
                        mutate(cityOrder=order(date)) %>% 
                        select(-numStores,-date))

  firm       date     city numStores cityOrder
1    1 2017-01-01 New York         1         1
2    1 2017-03-01       DC         1         2
3    1 2017-05-01 New York         2         1
4    1 2017-06-01  Atlanta         1         3
5    2 2017-02-01       DC         1         1
6    2 2017-04-01       DC         2         1
7    2 2017-05-01  Chicago         1         2
8    2 2017-06-01  Atlanta         1         3

One way:

library(dplyr)
first_entry <- df %>% 
      group_by(firm, city) %>% 
      summarize(first = min(date)) %>% 
      group_by(firm) %>% 
      mutate(order = order(first)) %>% 
      select(-first)

left_join(df, first_entry)

  firm       date     city numStores order
1    1 2017-01-01 New York         1     1
2    1 2017-03-01       DC         1     2
3    1 2017-05-01 New York         2     1
4    1 2017-06-01  Atlanta         1     3
5    2 2017-02-01       DC         1     1
6    2 2017-04-01       DC         2     1
7    2 2017-05-01  Chicago         1     2
8    2 2017-06-01  Atlanta         1     3

You can us factors with tapply in base R:

within(df,cityOrder <- unlist(tapply(city, firm, function(x) 
  as.numeric(factor(x,unique(x))))))

#   firm       date     city numStores cityOrder
# 1    1 2017-01-01 New York         1         1
# 2    1 2017-03-01       DC         1         2
# 3    1 2017-05-01 New York         2         1
# 4    1 2017-06-01  Atlanta         1         3
# 5    2 2017-02-01       DC         1         1
# 6    2 2017-04-01       DC         2         1
# 7    2 2017-05-01  Chicago         1         2
# 8    2 2017-06-01  Atlanta         1         3

(IF your dates are sorted for each firm as they are in this example)

A dplyr solution.

library(dplyr)

# create a summary by firm by city 
df_summary <- df %>%
  # sequence by earliest date by firm and city
  group_by(firm, city) %>%
  summarize(min_date = min(date)) %>%
  # number (using row_number)  *within* the firm group, arranged by date
  arrange(firm, min_date) %>%
  group_by(firm) %>%
  mutate(cityOrder = row_number()) %>%
  # drop the date column
  select(firm, city, cityOrder)

print(df_summary)
# A tibble: 6 x 3
# Groups:   firm [2]
# firm city     cityOrder
# <dbl> <fct>        <int>
# 1    1. New York         1
# 2    1. DC               2
# 3    1. Atlanta          3
# 4    2. DC               1
# 5    2. Chicago          2
# 6    2. Atlanta          3

## join the summary city order to the original
df_new <- df %>%
  left_join(df_summary, by = c(firm = "firm", city = "city")) 
print (df_new$cityOrder) 
# [1] 1 2 1 3 1 1 2 3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM