简体   繁体   English

在面板数据帧R中生成虚拟

[英]generating a dummy in a panel dataframe R

I really need your help with this. 我真的需要您的帮助。 I have a panel dataframe which looks something like this 我有一个面板数据框,看起来像这样

     Name            A                  B      

   1 Marco          01/09/2014         NA    
   2 Marco          NA                 01/01/2015    
   3 Marco          02/01/2015         NA    
   4 Luca           01/01/2015         NA    
   5 Luca           NA                 31/01/2015                        
   6 Silvia         NA                 15/01/2015  

and I want to create a dummy variable taking value 1 if (condition 1), in column A, observations do not show a 2014-date OR (condition 2) if, in column B, observations show a 2015-date AND, at the same time, there is at least another observation for that individual but none of them being associated with a 2014-date in column A. In other words, I do not know how to impose a condition for the dummy which checks all the other observations related to the same individual (identified in the column "Name"). 并且我想创建一个虚拟值为1的虚拟变量,如果(条件1)在A列中观察未显示2014年日期,或(条件2)如果在B列中观察显示2015年日期与,在同时,至少对该个人有另一个观察结果,但在A列中没有一个与2014年的日期相关联。换句话说,我不知道该如何为假人施加条件来检查所有其他与观察结果相关的条件同一个人(在“名称”列中标识)。 The result I want is something like this 我想要的结果是这样的

         Name            A                  B                     dummy

      1  Marco          01/09/2014         NA                     0    
      2  Marco          NA                 01/01/2015             0     
      3  Marco          02/01/2015         NA                     1    
      4  Luca           01/01/2015         NA                     1     
      5  Luca           NA                 31/01/2015             1                        
      6  Silvia         NA                 15/01/2015             0    

In the example above, the value of the dummy at the first observation is 0 because of the 2014-date in column A (condition 1 not verified). 在上面的示例中,由于列A中的2014年日期(条件1未得到验证),第一次观察时的虚拟值是0。 At the second observation, the dummy takes value 0 because, despite the fact of the 2015-date in column B, the same individual (Marco) presents a 2014-date in Column A in at least one of the other observations related to him (observation 1 in this case). 在第二次观察中,该假人取值为0,因为尽管在B列中存在2015年日期,但同一个人(Marco)在与他有关的其他至少一项观察中在A列中显示了2014年日期(在这种情况下,观察1)。 Observation 4 instead shows the dummy equal to 1 since the date in column A is 2015. Observation 5 shows the dummy equal to 1 since, despite the 2015-date in column B, the same individual (Luca) does not have other observations with a 2014-date in column A (it has a 2015-date in observation 4). 相反,观察值4显示的虚拟对象等于1,因为A列中的日期是2015。由于观察者5的虚拟对象等于1,因为尽管B列中的日期是2015年,但同一个人(Luca)没有其他具有A列中的2014年日期(观察值4中有2015年日期)。 Finally, the dummy associated with Silvia must be 0 since, despite the 2015-date in column B, there is no other Silvia's observation in the dataframe. 最后,与Silvia相关的虚拟对象必须为0,因为尽管B列中的日期为2015年,但数据框中没有其他Silvia的观察结果。

I hope it is not too twisted and that I expressed my idea. 我希望它不会太扭曲,我表达了我的想法。 Let me know if this is not clear. 让我知道是否不清楚。 Besides the conditions themselves, if you help me just with the way to impose conditions accross different observations related to the same individual it would already help a lot. 除了条件本身之外,如果您仅通过在与同一个人相关的不同观察结果之间施加条件的方式来帮助我,那将已经很有帮助。

Thank you all! 谢谢你们! Marco 马尔科

  structure(list(Name = c("Marco", "Marco", "Marco", "Luca", "Luca", "Silvia"), A = structure(c(1409529600, NA, 1420156800, 1420070400, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), B = structure(c(NA, 1420070400, NA, NA, 1422662400, 1421280000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")) 

You can use library lubridate and function from it year, to receive year from date. 您可以从年份开始使用库润滑和功能,以从日期接收年份。 Other note that if NA in if condition it gives NA, that is why it is better to convert NA to some values to use in if statements. 另请注意,如果NA在if条件中给出NA,这就是为什么最好将NA转换为要在if语句中使用的某些值。 Example of code is: 代码示例是:

    library(lubridate)

    Marco <- read.csv("Marcoset.csv",stringsAsFactors=F ) 
    Marco$A[is.na(Marco$A)] <- "01/01/0001"
    Marco$B[is.na(Marco$B)] <- "01/01/0001"
    Marco$A <- as.Date(Marco$A, "%d/%m/%Y")
    Marco$B <- as.Date(Marco$B, "%d/%m/%Y")

    Obs <-  Marco%>%
            group_by(Name)%>%
            mutate(i2014 = sign(sum(ifelse(year(A)=="2014",1,0))))%>%
            filter(year(A) !="2014" & year(A)!="0001")%>%
            select(Name, i2014)%>%
            group_by(Name, i2014)%>%
            summarise(obs=n()) 

      Marco <- Marco%>%
      left_join(Obs, by="Name")%>%
      mutate(dummy= ifelse(((year(A)!="2014"& year(A)!="1")|(year(B)=="2015" & obs>=2 & i2014==0)),1,0))%>%
      select(-obs, -i2014)

The NA s make it a little tricky, but here's a direct method, adding the implied condition "A is not NA " to the first case. NA使它有些棘手,但是这是一种直接方法,将隐含条件“ A不是NA ”添加到第一种情况。 Using %in% instead of == helps with other NA issues because 1 %in% NA is FALSE , but 1 == NA is NA . 使用%in%代替==可以解决其他NA问题,因为1 %in% NAFALSE ,而1 == NANA

dd %>% group_by(Name) %>%
  mutate(dummy = as.integer((
      !format(A, "%Y") %in% "2014" & !is.na(A)
    ) | (
      format(B, "%Y") %in% "2015"
      & n() > 1 
      & !any(format(A, "%Y") %in% "2014")
    )
  ))
# # A tibble: 6 x 4
# # Groups:   Name [3]
#   Name   A                   B                   dummy
#   <chr>  <dttm>              <dttm>              <int>
# 1 Marco  2014-09-01 00:00:00 NA                      0
# 2 Marco  NA                  2015-01-01 00:00:00     0
# 3 Marco  2015-01-02 00:00:00 NA                      1
# 4 Luca   2015-01-01 00:00:00 NA                      1
# 5 Luca   NA                  2015-01-31 00:00:00     1
# 6 Silvia NA                  2015-01-15 00:00:00     0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM