I am converting unstructured data into a long format and need to create an ID (grouping) variable. I want to assign an ID variable based on sets of values contained in another variable. More specifically, consider the following data set.
set.seed(1234); x.1 <- rep(letters[1:5], 10)
x.2 <- sample(c(0:10), 50, replace=TRUE)
x.3 <- rep(NA, 50); df <- data.frame(x.1, x.2, x.3)
df <- df[-c(2, 19),]
A unique case can be identified from the x.1 variable -- it starts with a
and ends with e
. This is always the case. x.3 will hold the ID (grouping) variable.
> head(df, 9)
x.1 x.2 x.3
a 1 NA
c 6 NA
d 6 NA
e 9 NA
a 7 NA
b 0 NA
c 2 NA
d 7 NA
e 5 NA
The number of records between a
and e
for a given case can vary considerably (in the real data file). Thus, I cannot assign a unique ID by simply dividing the variable by a fixed number of records. I figured out how to make the proper assignment by using a for loop:
START <- which(df$x.1== "a")
END <- which(df$x.1 == "e")
for(i in 1:length(START)){df$x.3[START[i]:END[i]] <- i}
head(df, 9)
x.1 x.2 x.3
a 1 1
c 6 1
d 6 1
e 9 1
a 7 2
b 0 2
c 2 2
d 7 2
e 5 2
The obvious problem with this approach is that it is much too slow for a data set with over one million records. It seems that lapply
could be an alternative, but I can't seem to figure out how to specify when a case ends and a new one begins as it traverses down through the data file. And, feel free to point me to an existing answer if one exists -- I didn't fine one!
Thanks in advance.
If there are no gaps between groups, ie after each "e" follows an "a" for the next group, you can use cumsum
easily:
df$x.3 <- cumsum(df$x.1 == "a")
df
# x.1 x.2 x.3
#1 a 1 1
#3 c 6 1
#4 d 6 1
#5 e 9 1
#6 a 7 2
#7 b 0 2
#8 c 2 2
#9 d 7 2
#10 e 5 2
#11 a 7 3
#12 b 5 3
#13 c 3 3
#...
And if your data was enormously large you could use data.table to update the data by reference:
library(data.table)
setDT(df)[, x.3 := cumsum(x.1 == "a")]
As correctly noted by @nicola in the comments, this assumes that a
s only appear at beginngs of groups, not in the middle of them . Based on the sample data, this seems like a valid assumption.
How it works:
Let's take a subset of column "x.1":
x <- df$x.1[1:15]
x
# [1] a c d e a b c d e a b c d e a
#Levels: a b c d e
You can now check if x is equal to "a" which will create a logical vector:
x == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
Now, what cumsum
does: it adds up cumulatively all the TRUE values (which are 1s essentially):
cumsum(x == "a")
# [1] 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4
So you can use logical vectors like numeric vectors and do mathematical calculations with them like a vector of 1s and 0s.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.