简体   繁体   English

循环创建虚拟变量

[英]Loop to create dummy variables

I'm currently working on a large data set (ca. 30k rows), And I'm working on creating a hedonic regression. 我目前正在研究一个大型数据集(大约30k行),而我正致力于创建一个hedonic回归。 The next step would be to create weekly dummy variables. 下一步是创建每周虚拟变量。

Now my data has assigned weekly number depending on the day the data was measured. 现在,我的数据已根据测量数据的日期分配了每周数量。 There are 50 different weeks (1-52, 2 missing unaccounted). 有50个不同的星期(1-52,2失踪下落不明)。 These weekly numbers are repeated until the change after about 10 rows, however they are also recurring, as new product categories are measures. 这些每周数字会重复,直到大约10 rows,之后的变化10 rows,但是它们也会重复出现,因为新产品类别是衡量标准。 There are 132 are available in the dataset and one category contains between 100 - 300 rows . 数据集中有132个可用,一个类别包含100 - 300 rows

This is an example of the dataset 这是数据集的一个示例

UPC         Weeks
1111112016  1
1111112016  1
1111112016  2
1111112016  2
1111112016  3
1111112016  3
1111112440  1
1111112440  1
1111112440  2
1111112440  2
1111112440  3
1111112440  3

Now to create dummy variables, I created 50 columns, each having about 30k rows to represent the dataset. 现在创建虚拟变量,我创建了50列,每列有大约30k行来表示数据集。 I would like to assign 1 to the row of the dummy week whenever dummy week (hence the column name) and real week (row of the orig. dataset) are equal. 每当假周(因此列名称)和实际周(orig。数据集的行)相等时,我想将1分配给虚拟周的行。

Example Dummy (DW = Dummy Week): 示例虚拟(DW =虚拟周):

DW1 DW2 
NA  NA
NA  NA
NA  NA

I tried the following: 我尝试了以下方法:

for (i in 1:seq(Soap$WEEK)){
if Soap$WEEK[i] == seq(from=1, by=1, to=52){
for (j in names(x)){
x$DW[[j]] = 1
else {
  x$DW[[j]] = 0
}}}}

I know it is wrong, however I'm unable to resolve my problem. 我知道这是错的,但是我无法解决我的问题。 I would appreciate any help in this matter. 在这件事情上,我将不胜感激。

We can use model.matrix() from the stats package to dummify your data. 我们可以使用stats包中的model.matrix()来对stats进行虚假化处理。 First, we'll need to convert Weeks to a factor column. 首先,我们需要将Weeks转换为factor列。

df$Weeks <- as.factor(df$Weeks)

Now we can run model.matrix() : 现在我们可以运行model.matrix()

model.matrix(~ Weeks + UPC + 0, data = df)
#   Weeks1 Weeks2 Weeks3        UPC
#1       1      0      0 1111112016
#2       1      0      0 1111112016
#3       0      1      0 1111112016
#4       0      1      0 1111112016
#5       0      0      1 1111112016
#6       0      0      1 1111112016
#7       1      0      0 1111112440
#8       1      0      0 1111112440
#9       0      1      0 1111112440
#10      0      1      0 1111112440
#11      0      0      1 1111112440
#12      0      0      1 1111112440

You can also just use model.matrix(~ . + 0 , data = df) , as numeric columns will be automatically passed over. 您也可以使用model.matrix(~ . + 0 , data = df) ,因为数字列将自动传递。 The + 0 in the formula avoids replacing the first level by the Intercept . 公式中的+ 0避免用Intercept替换第一级。 To see the difference try to run it without 0 . 要查看差异,请尝试在不使用0情况下运行它。

Alternatively, you can also use dummyVars from the caret package. 或者,您也可以使用caret包中的dummyVars Here, no Intercept is the default behaviour: 这里, 没有 Intercept是默认行为:

library(caret)

dm <- dummyVars(" ~ .", data = df)
data.frame(predict(dm, newdata = df))
#          UPC Weeks.1 Weeks.2 Weeks.3
#1  1111112016       1       0       0
#2  1111112016       1       0       0
#3  1111112016       0       1       0
#4  1111112016       0       1       0
#5  1111112016       0       0       1
#6  1111112016       0       0       1
#7  1111112440       1       0       0
#8  1111112440       1       0       0
#9  1111112440       0       1       0
#10 1111112440       0       1       0
#11 1111112440       0       0       1
#12 1111112440       0       0       1

You can solve this by using sapply and comparing the values of the Weeks column with th numeric part of the dummy column names which you can extract with substr . 您可以通过使用sapply并将Weeks列的值与可以使用substr提取的虚拟列名称的数字部分进行比较来解决此问题。

On your example dataset: 在您的示例数据集上:

# create the dummy columns and fill them with NA's
dat[, paste0('DW', 1:3)] <- NA

# compare the values in 'Weeks' with the numeric part of the column names
dat[, 3:5] <- sapply(names(dat)[3:5], function(x) as.integer(substr(x,3,3) == dat$Weeks))

the result: 结果:

> dat
          UPC Weeks DW1 DW2 DW3
1  1111112016     1   1   0   0
2  1111112016     1   1   0   0
3  1111112016     2   0   1   0
4  1111112016     2   0   1   0
5  1111112016     3   0   0   1
6  1111112016     3   0   0   1
7  1111112440     1   1   0   0
8  1111112440     1   1   0   0
9  1111112440     2   0   1   0
10 1111112440     2   0   1   0
11 1111112440     3   0   0   1
12 1111112440     3   0   0   1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM