简体   繁体   English

使用 data.table 创建组

[英]Creating groups using data.table

The working dataset looks like:工作数据集如下所示:

library('data.table')
df <- data.table(Name = c("a","a","b","b","c","c","d","d","e","e","f","f"),
                 Y = sample(1:30,12),
                 X = sample(1:30,12))

df
    Name  Y  X
 1:    a 14 23
 2:    a 19 18
 3:    b 10 16
 4:    b 23 11
 5:    c  2 12
 6:    c 12 24
 7:    d  8 14
 8:    d 26  2
 9:    e 16 26
10:    e  6  4
11:    f 29 28
12:    f 28 30

What I eventually want is to make graph by groups (based on Name ) for comparison:我最终想要的是按组(基于Name )制作图表以进行比较:

library(ggplot2)
ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ Name)

Since the actual dataset contains much more observations and grp .由于实际数据集包含更多的观察和grp The ggplot I am creating takes too much time to process and the final graph is unreadable ( grp > 300).我正在创建的 ggplot 需要太多时间来处理,并且最终的图形不可读( grp > 300)。 I am planning to re-group the data with limited number of observations and graph them separately (for example, graph 10 groups each time).我打算用有限数量的观察重新分组数据并分别绘制它们(例如,每次绘制 10 个组)。

So the final dataset should looks like:所以最终的数据集应该是这样的:

    Name  Y  X grp level
 1:    a 14 23   1     1
 2:    a 19 18   1     1
 3:    b 10 16   2     1
 4:    b 23 11   2     1
 5:    c  2 12   3     1
 6:    c 12 24   3     1
 7:    d  8 14   4     2
 8:    d 26  2   4     2
 9:    e 16 26   5     2
10:    e  6  4   5     2
11:    f 29 28   6     2
12:    f 28 30   6     2

and then I can perform the graphing based on the new group level :然后我可以根据新的组level执行绘图:

ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

In the above illustration, I created the grp simply by:在上图中,我简单地通过以下方式创建了grp

df[, grp := .GRP, by = Name]

The question now is how to create the level group automatically based on grp (I have to create grp rather than using Name directly as basis since, in the original dataset, there is no patterns in Name )?现在的问题是如何根据grp自动创建level组(我必须创建grp而不是直接使用Name作为基础,因为在原始数据集中, Name没有模式)?

I tried something like:我试过类似的东西:

setkey(df, grp)
i <- 1
j <- 1
while(i < 4 ) {
  df[levels(factor(grp)) == (i:i+2), level := j]
  i <- i + 2
  j <- j + 1
}

It does not work well as I need.它不能很好地满足我的需要。 Could anyone give me some advice how to address this problem?谁能给我一些建议如何解决这个问题? I am really stuck here.我真的被困在这里了。 I guess there could be a simple way to do this, maybe I don't even need to create the level group and can create the separate graphing directly by other means?我想可能有一种简单的方法可以做到这一点,也许我什至不需要创建level组并且可以通过其他方式直接创建单独的图形?

If there are only a few groups, the fct_collapse() function from the forcats package can be used.如果只有几个组,可以使用forcats包中的fct_collapse()函数。 It allows to collapse factor levels into manually defined groups easily.它允许轻松地将因子级别折叠到手动定义的组中。

By, this the new variable level can be created directly without making a detour over group numbers and cut() .通过,可以直接创建新的变量level而无需绕过组号和cut() And, the levels can be assigned meaningful labels.并且,可以为级别分配有意义的标签。

library('data.table')
df <- data.table(Name = rep(letters[1:6], each = 2),
                 Y = sample(1:30,12),
                 X = sample(1:30,12))
df[, level := forcats::fct_collapse(Name, "a-c" = letters[1:3], "d-e" = letters[4:6])]
df
#    Name  Y  X level
# 1:    a 11 13   a-c
# 2:    a 29 12   a-c
# 3:    b 16  5   a-c
# 4:    b 12  6   a-c
# 5:    c 25 28   a-c
# 6:    c 27 11   a-c
# 7:    d  5  9   d-e
# 8:    d 23 20   d-e
# 9:    e 13 26   d-e
#10:    e 17 19   d-e
#11:    f 19  8   d-e
#12:    f 22  3   d-e

However, the OP mentioned that there are many groups ( df[, uniqueN(Name)] > 300 ) and that he wants to re-group the data with limited number of observations .但是,OP 提到有很多组( df[, uniqueN(Name)] > 300 ),并且他想用有限数量的观察重新分组数据 Using cut() in the way proposed in this comment may lead to unsatisfactory results.本评论中建议的方式使用cut()可能会导致不满意的结果。

To demonstrate this we need to create a larger sample data set of 100 rows:为了证明这一点,我们需要创建一个更大的 100 行样本数据集:

N <- 100
set.seed(1234)
df <- data.table(Name = sample(letters, N, replace = TRUE),
                 Y = sample(seq.int(3*N), N),
                 X = sample(seq.int(3*N), N))
df

Note that set.seed() is used to make the data reproducible.请注意, set.seed()用于使数据可重现。

Now, the number of unique values of Name (which corresponds to OP's grp ) is split in 6 levels and plotted in facets (following this comment ):现在, Name的唯一值(对应于 OP 的grp )的数量分为 6 个级别并绘制在方面(遵循此评论):

n_lvls <- 6
df[, level := as.numeric(cut(as.integer(factor(Name)), breaks = n_lvls))] 
ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

在此处输入图片说明

Here, facet 3 contains only a few data points while other facets appear quite crowded.在这里,facet 3 只包含几个数据点,而其他 facet 看起来非常拥挤。


To overcome this, the levels can be arranged to contain approximately the same number of data points instead of the same number of factor levels:为了克服这个问题,可以将级别安排为包含大致相同数量的数据点,而不是相同数量的因子级别:

lvls <- df[, .N, by = Name][order(-N), level := cut(cumsum(N), n_lvls, labels = FALSE)]
df <- lvls[df, on = "Name"]

ggplot(df, aes(X, Y)) + geom_point() + facet_grid(. ~ level)

在此处输入图片说明

Now, the observations are more evenly distributed among the facets.现在,观测值更均匀地分布在各个方面。

The code counts the number of observations per Name , sorts in descending order of N , uses cut() on the cumulative sum of observations to create a data.table lvls ofthe new levels.该代码计算每个Name的观察数,按N降序排序,对观察的累积总和使用cut()以创建新级别的 data.table lvls Finally, the new levels are right joined with the original data set df .最后,新级别与原始数据集df正确连接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM