简体   繁体   中英

Loop to create dummy variables

I'm currently working on a large data set (ca. 30k rows), And I'm working on creating a hedonic regression. The next step would be to create weekly dummy variables.

Now my data has assigned weekly number depending on the day the data was measured. There are 50 different weeks (1-52, 2 missing unaccounted). These weekly numbers are repeated until the change after about 10 rows, however they are also recurring, as new product categories are measures. There are 132 are available in the dataset and one category contains between 100 - 300 rows .

This is an example of the dataset

UPC         Weeks
1111112016  1
1111112016  1
1111112016  2
1111112016  2
1111112016  3
1111112016  3
1111112440  1
1111112440  1
1111112440  2
1111112440  2
1111112440  3
1111112440  3

Now to create dummy variables, I created 50 columns, each having about 30k rows to represent the dataset. I would like to assign 1 to the row of the dummy week whenever dummy week (hence the column name) and real week (row of the orig. dataset) are equal.

Example Dummy (DW = Dummy Week):

DW1 DW2 
NA  NA
NA  NA
NA  NA

I tried the following:

for (i in 1:seq(Soap$WEEK)){
if Soap$WEEK[i] == seq(from=1, by=1, to=52){
for (j in names(x)){
x$DW[[j]] = 1
else {
  x$DW[[j]] = 0
}}}}

I know it is wrong, however I'm unable to resolve my problem. I would appreciate any help in this matter.

We can use model.matrix() from the stats package to dummify your data. First, we'll need to convert Weeks to a factor column.

df$Weeks <- as.factor(df$Weeks)

Now we can run model.matrix() :

model.matrix(~ Weeks + UPC + 0, data = df)
#   Weeks1 Weeks2 Weeks3        UPC
#1       1      0      0 1111112016
#2       1      0      0 1111112016
#3       0      1      0 1111112016
#4       0      1      0 1111112016
#5       0      0      1 1111112016
#6       0      0      1 1111112016
#7       1      0      0 1111112440
#8       1      0      0 1111112440
#9       0      1      0 1111112440
#10      0      1      0 1111112440
#11      0      0      1 1111112440
#12      0      0      1 1111112440

You can also just use model.matrix(~ . + 0 , data = df) , as numeric columns will be automatically passed over. The + 0 in the formula avoids replacing the first level by the Intercept . To see the difference try to run it without 0 .

Alternatively, you can also use dummyVars from the caret package. Here, no Intercept is the default behaviour:

library(caret)

dm <- dummyVars(" ~ .", data = df)
data.frame(predict(dm, newdata = df))
#          UPC Weeks.1 Weeks.2 Weeks.3
#1  1111112016       1       0       0
#2  1111112016       1       0       0
#3  1111112016       0       1       0
#4  1111112016       0       1       0
#5  1111112016       0       0       1
#6  1111112016       0       0       1
#7  1111112440       1       0       0
#8  1111112440       1       0       0
#9  1111112440       0       1       0
#10 1111112440       0       1       0
#11 1111112440       0       0       1
#12 1111112440       0       0       1

You can solve this by using sapply and comparing the values of the Weeks column with th numeric part of the dummy column names which you can extract with substr .

On your example dataset:

# create the dummy columns and fill them with NA's
dat[, paste0('DW', 1:3)] <- NA

# compare the values in 'Weeks' with the numeric part of the column names
dat[, 3:5] <- sapply(names(dat)[3:5], function(x) as.integer(substr(x,3,3) == dat$Weeks))

the result:

> dat
          UPC Weeks DW1 DW2 DW3
1  1111112016     1   1   0   0
2  1111112016     1   1   0   0
3  1111112016     2   0   1   0
4  1111112016     2   0   1   0
5  1111112016     3   0   0   1
6  1111112016     3   0   0   1
7  1111112440     1   1   0   0
8  1111112440     1   1   0   0
9  1111112440     2   0   1   0
10 1111112440     2   0   1   0
11 1111112440     3   0   0   1
12 1111112440     3   0   0   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM