Creating groups using data.table

Question

The working dataset looks like:

library('data.table')
df <- data.table(Name = c("a","a","b","b","c","c","d","d","e","e","f","f"),
                 Y = sample(1:30,12),
                 X = sample(1:30,12))

df
    Name  Y  X
 1:    a 14 23
 2:    a 19 18
 3:    b 10 16
 4:    b 23 11
 5:    c  2 12
 6:    c 12 24
 7:    d  8 14
 8:    d 26  2
 9:    e 16 26
10:    e  6  4
11:    f 29 28
12:    f 28 30

What I eventually want is to make graph by groups (based on Name ) for comparison:

library(ggplot2)
ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ Name)

Since the actual dataset contains much more observations and grp . The ggplot I am creating takes too much time to process and the final graph is unreadable ( grp > 300). I am planning to re-group the data with limited number of observations and graph them separately (for example, graph 10 groups each time).

So the final dataset should looks like:

    Name  Y  X grp level
 1:    a 14 23   1     1
 2:    a 19 18   1     1
 3:    b 10 16   2     1
 4:    b 23 11   2     1
 5:    c  2 12   3     1
 6:    c 12 24   3     1
 7:    d  8 14   4     2
 8:    d 26  2   4     2
 9:    e 16 26   5     2
10:    e  6  4   5     2
11:    f 29 28   6     2
12:    f 28 30   6     2

and then I can perform the graphing based on the new group level :

ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

In the above illustration, I created the grp simply by:

df[, grp := .GRP, by = Name]

The question now is how to create the level group automatically based on grp (I have to create grp rather than using Name directly as basis since, in the original dataset, there is no patterns in Name )?

I tried something like:

setkey(df, grp)
i <- 1
j <- 1
while(i < 4 ) {
  df[levels(factor(grp)) == (i:i+2), level := j]
  i <- i + 2
  j <- j + 1
}

It does not work well as I need. Could anyone give me some advice how to address this problem? I am really stuck here. I guess there could be a simple way to do this, maybe I don't even need to create the level group and can create the separate graphing directly by other means?

Answer 1

If there are only a few groups, the fct_collapse() function from the forcats package can be used. It allows to collapse factor levels into manually defined groups easily.

By, this the new variable level can be created directly without making a detour over group numbers and cut() . And, the levels can be assigned meaningful labels.

library('data.table')
df <- data.table(Name = rep(letters[1:6], each = 2),
                 Y = sample(1:30,12),
                 X = sample(1:30,12))
df[, level := forcats::fct_collapse(Name, "a-c" = letters[1:3], "d-e" = letters[4:6])]
df
#    Name  Y  X level
# 1:    a 11 13   a-c
# 2:    a 29 12   a-c
# 3:    b 16  5   a-c
# 4:    b 12  6   a-c
# 5:    c 25 28   a-c
# 6:    c 27 11   a-c
# 7:    d  5  9   d-e
# 8:    d 23 20   d-e
# 9:    e 13 26   d-e
#10:    e 17 19   d-e
#11:    f 19  8   d-e
#12:    f 22  3   d-e

However, the OP mentioned that there are many groups ( df[, uniqueN(Name)] > 300 ) and that he wants to re-group the data with limited number of observations . Using cut() in the way proposed in this comment may lead to unsatisfactory results.

To demonstrate this we need to create a larger sample data set of 100 rows:

N <- 100
set.seed(1234)
df <- data.table(Name = sample(letters, N, replace = TRUE),
                 Y = sample(seq.int(3*N), N),
                 X = sample(seq.int(3*N), N))
df

Note that set.seed() is used to make the data reproducible.

Now, the number of unique values of Name (which corresponds to OP's grp ) is split in 6 levels and plotted in facets (following this comment ):

n_lvls <- 6
df[, level := as.numeric(cut(as.integer(factor(Name)), breaks = n_lvls))] 
ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

Here, facet 3 contains only a few data points while other facets appear quite crowded.

To overcome this, the levels can be arranged to contain approximately the same number of data points instead of the same number of factor levels:

lvls <- df[, .N, by = Name][order(-N), level := cut(cumsum(N), n_lvls, labels = FALSE)]
df <- lvls[df, on = "Name"]

ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

Now, the observations are more evenly distributed among the facets.

The code counts the number of observations per Name , sorts in descending order of N , uses cut() on the cumulative sum of observations to create a data.table lvls ofthe new levels. Finally, the new levels are right joined with the original data set df .

Creating groups using data.table

Question

1 answers

solution1
1 ACCPTED 2017-04-07 10:05:09

Creating groups using data.table

Question

1 answers

solution1 1 ACCPTED 2017-04-07 10:05:09

solution1
1 ACCPTED 2017-04-07 10:05:09