[英]R - collapsing a data.frame by unique ID & generating multiple dummy variables
I'm facing a problem I've been trying to solve for a few days now, and I just can't wrap my head around it. 我遇到了我几天来一直试图解决的问题,但是我无法解决这个问题。 Maybe y'all know of a good solution.
也许你们都知道一个好的解决方案。
I have a data frame, with approx 3,000,000 rows. 我有一个数据框,大约有3,000,000行。 There is one crucial ID variable, with approx 200,000 rows.
有一个关键的ID变量,大约有200,000行。 I want to collapse the data.frame to a new data.frame, which only has 1 row for each unique ID variable value.
我想将data.frame折叠到一个新的data.frame,每个唯一ID变量值只有1行。
Furthermore, there are a bunch of variables which are also duplicates whenever ID is a duplicate. 此外,每当ID为重复项时,也会有很多变量也是重复项。 Here's an example:
这是一个例子:
ID NAME CAR
42 Bob Ford
42 Bob Ford
42 Bob Ford
However, there are also some variables which vary for a subset of the data frame, which denote specific events or actions taken. 但是,还有一些变量会随数据帧的子集而变化,这些变量表示特定的事件或采取的措施。 Here's an example:
这是一个例子:
ID NAME CAR ACTION ACTION_ID
42 Bob Ford REFILL 4201
42 Bob Ford DELIVER 4202
42 Bob Ford REFILL 4203
What I want, is for this to be flattened to 1 row, but with new dummy variables. 我想要的是将其展平为1行,但带有新的虚拟变量。 Let's assume that
ACTION
has 5 values of interest, REFILL, DELIVER, PARK, PICKUP, PATROL
in the ENTIRE original data.frame. 假设
ACTION
在整个原始data.frame中有5个感兴趣的值,即REFILL, DELIVER, PARK, PICKUP, PATROL
。 Furthermore, the ACTION_ID
variable is only relevant to the overall ID, and for every given ID
variable, there is a maximum number of 5 unique ACTION_ID
values. 此外,
ACTION_ID
变量仅与整个ID相关,并且对于每个给定的ID
变量,最多有5个唯一ACTION_ID
值。
What I'd like to have is dummy variables for every possible combination of ACTION
and ACTION_ID
, which would look something like this 我想要的是
ACTION
和ACTION_ID
每种可能组合的伪变量,看起来像这样
ID NAME CAR REFILL_01 REFILL_02 REFILL_03 REFILL_04 REFILL_05
42 Bob Ford TRUE FALSE TRUE NA NA
DELIVER_01 DELIVER_02 DELIVER_03 DELIVER_04 DELIVER_05
FALSE TRUE FALSE NA NA
with further dummy variables for PARK_n, PICKUP_n
and PATROL_n
whereby n=1:5
. 以及用于
PARK_n, PICKUP_n
和PATROL_n
其他虚拟变量PARK_n, PICKUP_n
其中n=1:5
。
I've tried to achieve this with a number of loops whereby I subset the big data.frame by unique ID and then try to generate the new variables and append them to a new data frame. 我尝试通过许多循环来实现这一点,在这些循环中,我通过唯一的ID对大data.frame进行了子集设置,然后尝试生成新变量并将其附加到新数据帧中。 But this never works consistently.
但是,这永远不会始终如一。 I'd be so so grateful if someone had any kind of idea as to how to make this work!
如果有人对如何完成这项工作有任何想法,我将非常感激!
All the best Nik 祝你一切顺利
I was able to make this work. 我能够完成这项工作。 You will need to write the additional code by hand, but this will solve it for you.
您将需要手工编写其他代码,但这将为您解决。 I am assuming your dataframe is named "df"
我假设您的数据框名为“ df”
library(dplyr)
new <- df %>% group_by(ID,NAME) %>% mutate(REFILL_01 =
ifelse(ACTION=="Refill" & substr(ACTION_ID,4,4) == 1,"TRUE","FALSE"),
REFILL_02 = ifelse(ACTION=="Refill" & substr(ACTION_ID,4,4) == 2, "TRUE","FALSE"))
This takes the data, groups it by ID and then NAME. 这将获取数据,并按ID和NAME将其分组。 And then we start making the dummy variables.
然后我们开始制作虚拟变量。 I will walk you through the first one.
我将引导您完成第一个。 REFILL_01 equals TRUE if both ACTION=Refill and if the ending number of ACTION_ID is a 1. Otherwise, it gets a false.
如果ACTION = Refill和ACTION_ID的结尾数均为1,则REFILL_01等于TRUE。否则,它为false。 Let me know if this makes sense or if you need any more clarification.
让我知道这是否有意义,或者您是否需要进一步澄清。 You just need to add the other dummy variable you want now.
您只需要添加现在想要的其他虚拟变量。 I did the REFILL_01 and REFILL_02 for you.
我为您做了REFILL_01和REFILL_02。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.