简体   繁体   English

R-通过唯一ID折叠data.frame并生成多个虚拟变量

[英]R - collapsing a data.frame by unique ID & generating multiple dummy variables

I'm facing a problem I've been trying to solve for a few days now, and I just can't wrap my head around it. 我遇到了我几天来一直试图解决的问题,但是我无法解决这个问题。 Maybe y'all know of a good solution. 也许你们都知道一个好的解决方案。

I have a data frame, with approx 3,000,000 rows. 我有一个数据框,大约有3,000,000行。 There is one crucial ID variable, with approx 200,000 rows. 有一个关键的ID变量,大约有200,000行。 I want to collapse the data.frame to a new data.frame, which only has 1 row for each unique ID variable value. 我想将data.frame折叠到一个新的data.frame,每个唯一ID变量值只有1行。

Furthermore, there are a bunch of variables which are also duplicates whenever ID is a duplicate. 此外,每当ID为重复项时,也会有很多变量也是重复项。 Here's an example: 这是一个例子:

ID    NAME   CAR
42    Bob    Ford
42    Bob    Ford
42    Bob    Ford

However, there are also some variables which vary for a subset of the data frame, which denote specific events or actions taken. 但是,还有一些变量会随数据帧的子集而变化,这些变量表示特定的事件或采取的措施。 Here's an example: 这是一个例子:

ID    NAME   CAR     ACTION    ACTION_ID
42    Bob    Ford    REFILL    4201
42    Bob    Ford    DELIVER   4202
42    Bob    Ford    REFILL    4203

What I want, is for this to be flattened to 1 row, but with new dummy variables. 我想要的是将其展平为1行,但带有新的虚拟变量。 Let's assume that ACTION has 5 values of interest, REFILL, DELIVER, PARK, PICKUP, PATROL in the ENTIRE original data.frame. 假设ACTION整个原始data.frame中有5个感兴趣的值,即REFILL, DELIVER, PARK, PICKUP, PATROL Furthermore, the ACTION_ID variable is only relevant to the overall ID, and for every given ID variable, there is a maximum number of 5 unique ACTION_ID values. 此外, ACTION_ID变量仅与整个ID相关,并且对于每个给定的ID变量,最多有5个唯一ACTION_ID值。

What I'd like to have is dummy variables for every possible combination of ACTION and ACTION_ID , which would look something like this 我想要的是ACTIONACTION_ID每种可能组合的伪变量,看起来像这样

ID    NAME   CAR     REFILL_01    REFILL_02    REFILL_03    REFILL_04    REFILL_05
42    Bob    Ford    TRUE         FALSE        TRUE         NA               NA

DELIVER_01    DELIVER_02    DELIVER_03    DELIVER_04    DELIVER_05
FALSE         TRUE          FALSE         NA            NA

with further dummy variables for PARK_n, PICKUP_n and PATROL_n whereby n=1:5 . 以及用于PARK_n, PICKUP_nPATROL_n其他虚拟变量PARK_n, PICKUP_n其中n=1:5

I've tried to achieve this with a number of loops whereby I subset the big data.frame by unique ID and then try to generate the new variables and append them to a new data frame. 我尝试通过许多循环来实现这一点,在这些循环中,我通过唯一的ID对大data.frame进行了子集设置,然后尝试生成新变量并将其附加到新数据帧中。 But this never works consistently. 但是,这永远不会始终如一。 I'd be so so grateful if someone had any kind of idea as to how to make this work! 如果有人对如何完成这项工作有任何想法,我将非常感激!

All the best Nik 祝你一切顺利

I was able to make this work. 我能够完成这项工作。 You will need to write the additional code by hand, but this will solve it for you. 您将需要手工编写其他代码,但这将为您解决。 I am assuming your dataframe is named "df" 我假设您的数据框名为“ df”

library(dplyr)    
new <- df %>% group_by(ID,NAME) %>% mutate(REFILL_01 =
 ifelse(ACTION=="Refill" & substr(ACTION_ID,4,4) == 1,"TRUE","FALSE"),
 REFILL_02 = ifelse(ACTION=="Refill" & substr(ACTION_ID,4,4) == 2, "TRUE","FALSE"))

This takes the data, groups it by ID and then NAME. 这将获取数据,并按ID和NAME将其分组。 And then we start making the dummy variables. 然后我们开始制作虚拟变量。 I will walk you through the first one. 我将引导您完成第一个。 REFILL_01 equals TRUE if both ACTION=Refill and if the ending number of ACTION_ID is a 1. Otherwise, it gets a false. 如果ACTION = Refill和ACTION_ID的结尾数均为1,则REFILL_01等于TRUE。否则,它为false。 Let me know if this makes sense or if you need any more clarification. 让我知道这是否有意义,或者您是否需要进一步澄清。 You just need to add the other dummy variable you want now. 您只需要添加现在想要的其他虚拟变量。 I did the REFILL_01 and REFILL_02 for you. 我为您做了REFILL_01和REFILL_02。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM