简体   繁体   中英

Creating groups using data.table

The working dataset looks like:

library('data.table')
df <- data.table(Name = c("a","a","b","b","c","c","d","d","e","e","f","f"),
                 Y = sample(1:30,12),
                 X = sample(1:30,12))

df
    Name  Y  X
 1:    a 14 23
 2:    a 19 18
 3:    b 10 16
 4:    b 23 11
 5:    c  2 12
 6:    c 12 24
 7:    d  8 14
 8:    d 26  2
 9:    e 16 26
10:    e  6  4
11:    f 29 28
12:    f 28 30

What I eventually want is to make graph by groups (based on Name ) for comparison:

library(ggplot2)
ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ Name)

Since the actual dataset contains much more observations and grp . The ggplot I am creating takes too much time to process and the final graph is unreadable ( grp > 300). I am planning to re-group the data with limited number of observations and graph them separately (for example, graph 10 groups each time).

So the final dataset should looks like:

    Name  Y  X grp level
 1:    a 14 23   1     1
 2:    a 19 18   1     1
 3:    b 10 16   2     1
 4:    b 23 11   2     1
 5:    c  2 12   3     1
 6:    c 12 24   3     1
 7:    d  8 14   4     2
 8:    d 26  2   4     2
 9:    e 16 26   5     2
10:    e  6  4   5     2
11:    f 29 28   6     2
12:    f 28 30   6     2

and then I can perform the graphing based on the new group level :

ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

In the above illustration, I created the grp simply by:

df[, grp := .GRP, by = Name]

The question now is how to create the level group automatically based on grp (I have to create grp rather than using Name directly as basis since, in the original dataset, there is no patterns in Name )?

I tried something like:

setkey(df, grp)
i <- 1
j <- 1
while(i < 4 ) {
  df[levels(factor(grp)) == (i:i+2), level := j]
  i <- i + 2
  j <- j + 1
}

It does not work well as I need. Could anyone give me some advice how to address this problem? I am really stuck here. I guess there could be a simple way to do this, maybe I don't even need to create the level group and can create the separate graphing directly by other means?

If there are only a few groups, the fct_collapse() function from the forcats package can be used. It allows to collapse factor levels into manually defined groups easily.

By, this the new variable level can be created directly without making a detour over group numbers and cut() . And, the levels can be assigned meaningful labels.

library('data.table')
df <- data.table(Name = rep(letters[1:6], each = 2),
                 Y = sample(1:30,12),
                 X = sample(1:30,12))
df[, level := forcats::fct_collapse(Name, "a-c" = letters[1:3], "d-e" = letters[4:6])]
df
#    Name  Y  X level
# 1:    a 11 13   a-c
# 2:    a 29 12   a-c
# 3:    b 16  5   a-c
# 4:    b 12  6   a-c
# 5:    c 25 28   a-c
# 6:    c 27 11   a-c
# 7:    d  5  9   d-e
# 8:    d 23 20   d-e
# 9:    e 13 26   d-e
#10:    e 17 19   d-e
#11:    f 19  8   d-e
#12:    f 22  3   d-e

However, the OP mentioned that there are many groups ( df[, uniqueN(Name)] > 300 ) and that he wants to re-group the data with limited number of observations . Using cut() in the way proposed in this comment may lead to unsatisfactory results.

To demonstrate this we need to create a larger sample data set of 100 rows:

N <- 100
set.seed(1234)
df <- data.table(Name = sample(letters, N, replace = TRUE),
                 Y = sample(seq.int(3*N), N),
                 X = sample(seq.int(3*N), N))
df

Note that set.seed() is used to make the data reproducible.

Now, the number of unique values of Name (which corresponds to OP's grp ) is split in 6 levels and plotted in facets (following this comment ):

n_lvls <- 6
df[, level := as.numeric(cut(as.integer(factor(Name)), breaks = n_lvls))] 
ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

在此处输入图片说明

Here, facet 3 contains only a few data points while other facets appear quite crowded.


To overcome this, the levels can be arranged to contain approximately the same number of data points instead of the same number of factor levels:

lvls <- df[, .N, by = Name][order(-N), level := cut(cumsum(N), n_lvls, labels = FALSE)]
df <- lvls[df, on = "Name"]

ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

在此处输入图片说明

Now, the observations are more evenly distributed among the facets.

The code counts the number of observations per Name , sorts in descending order of N , uses cut() on the cumulative sum of observations to create a data.table lvls ofthe new levels. Finally, the new levels are right joined with the original data set df .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM