简体   繁体   中英

Loop over data.frame columns to generate dummy variable in R

I'm struggling with generating a variable for my current project. I'm using R version 4.0.1 on Windows.

Data description

I have unbalanced panel data in a data.table containing 243 variables (before running the commands) and 8,278 observations. The data is uniquely identified by ID and period . Additionally, in columns 69:135 I got different region dummies (2= yes, company operates in region; 1= no, company does not operate in region) and in columns 178:244 lagged versions of the very same variables from columns 69:135 grouped by ID. Here is a small example of the data:

dat <- 
data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
region1 = as.factor(c(NA, NA, 2, 1, NA, 1, 2, 2, 1, NA, 1, rep(NA, 7))),
region2 = as.factor(c(1, 2, 1, 1, NA, NA, 2, 1, 2, 1, 1, rep(NA, 7))),
industry = as.factor(c(rep("Finance", 3), "Culture", rep("Nutrition", 5), rep("Finance", 2), rep("Medicine", 7))),
number_employees = as.numeric(c(10, 10, 12, 2, 2, 4, 4, 4, 4, 18, 25, 100, 110, 108, 108, 120, 120, 120)),
lag_region1 = as.factor(c(rep(NA, 6), 1, 2, 2, rep(NA, 9))),
lag_region2 = as.factor(c(NA, 1, 2, rep(NA, 4), 2, 1, NA, 1, rep(NA, 7))))


#this gives (last 8 rows are not printed):
#      id period region1 region2  industry number_employees lag_region1 lag_region2
# 1: C001      1    <NA>       1   Finance               10        <NA>        <NA>
# 2: C001      2    <NA>       2   Finance               10        <NA>           1
# 3: C001      3       2       1   Finance               12        <NA>           2
# 4: C002      2       1       1   Culture                2        <NA>        <NA>
# 5: C003      1    <NA>    <NA> Nutrition                2        <NA>        <NA>
# 6: C003      4       1    <NA> Nutrition                4        <NA>        <NA>
# 7: C003      5       2       2 Nutrition                4           1        <NA>
# 8: C003      6       2       1 Nutrition                4           2           2
# 9: C003     10       1       2 Nutrition                4           2           1
#10: C004      3    <NA>       1   Finance               18        <NA>        <NA>

Desired outcome

I want to generate a new dummy variable left_region which equals "yes" when a company has left at least one region in the respective period. I wanted to approach this issue by "comparing" column 69 to column 178, 70 to 179, 71 to 180, etc. left_region should be set to "yes" if eg dt[, 69] == 1 & dt[, 178] == 2 (so, left_region equals "yes" if a company leaves a region it was operating in before). The desired result looks like this:

# desired result (last 8 rows are not printed):
#      id period region1 region2  industry number_employees lag_region1 lag_region2 left_region
# 1: C001      1    <NA>       1   Finance               10        <NA>        <NA>          no
# 2: C001      2    <NA>       2   Finance               10        <NA>           1          no
# 3: C001      3       2       1   Finance               12        <NA>           2         yes
# 4: C002      2       1       1   Culture                2        <NA>        <NA>          no
# 5: C003      1    <NA>    <NA> Nutrition                2        <NA>        <NA>          no
# 6: C003      4       1    <NA> Nutrition                4        <NA>        <NA>          no
# 7: C003      5       2       2 Nutrition                4           1        <NA>          no
# 8: C003      6       2       1 Nutrition                4           2           2         yes
# 9: C003     10       1       2 Nutrition                4           2           1         yes
#10: C004      3    <NA>       1   Finance               18        <NA>        <NA>          no

Problem description

I'm struggling to get this running for all observations at once though. I tried it using ifelse() in a for loop. For this to work I had to make my data.table a data.frame first.

# generate empty cells
df <- data.frame(matrix(NA, nrow = 8278, ncol = 67))
# combine prior data.table and new data.frame in large data.frame (with data.table the following loop does not work)
dt <- as.data.frame(cbind(dt, df))

# loop through 67 columns comparing 69 to 178, 70 to 179, etc.
for (i in 69:135) {
 dt[, i + 176] <- ifelse(is.na(dt[, i]) & is.na(dt[, (i + 109)]), NA,
         ifelse(dt[, i] == 1 & dt[, (i + 109)] == 2, "yes", "no"
         )
  )
}

# generate final dummy variable left_region --> there is some error here
dt$left_region <-
  ifelse(any(dt[, c(245:311)] == "yes"), "yes", "no")

Running the last ifelse() in combination with any() , however, leads to left_region containing only "yes" for every of the 8,278 oberservations.

I tested how the latter ifelse() command behaves if using only one observation.

#take out one observation
one_row <- dt[7, ]

library(dplyr)
# generate left_region for one observation only
new <- 
  one_row %>%
  mutate(left_region = ifelse(any(one_row[, c(245:311)] == "yes"), "yes", "no"))

The picked observation should generate left_region == "no" but it does the opposite in this case. It seems that somehow the last ifelse() argument "no" is not registered by R.

Aside from not being a "pretty" solution to the problem neither putting the combination of ifelse() and any() into a for() loop solves the issue. In this case left_region only takes on "yes" in 270 cases but still never "no".

for (i in 1:nrow(dt)) {
  dt$left_region[i] <-
    ifelse(any(dt[i, c(245:311)] == "yes"), "yes", "no")
}

Does anyone know why this happens? What do I need to do in order to receive my desired result? Any idea is highly appreciated!

I very much hope that I managed to explain everything in an easily understandable manner. Thanks very much in advance!

dt[, 69:135] == 1 will return TRUE if the value in column 69:135 is 1 and FALSE otherwise.

dt[, 178:244] == 2 will return TRUE if the value in column 178:244 is 2 and FALSE otherwise.

You can perform an AND ( & ) operation between them to compare them elementwise meaning dt[, 69] & dt[, 178] , dt[, 70] & dt[, 179] and so on. Take rowwise sum of them and mark it as 'Yes' even if a single TRUE is found in that row.

dt$left_region <- ifelse(rowSums(dt[, 69:135] == 1 & dt[, 178:244] == 2) > 0, 'yes', 'no')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM