简体   繁体   English

按 data.table 中的所有列分组

[英]Group by all columns in a data.table

I'm working with iris data.table in R.我在 R 中使用iris data.table。

To remind how it looks I paste six five rows here为了提醒它的外观,我在这里粘贴了六五行

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa
6:          5.4         3.9          1.7         0.4  setosa

I would like to calculate the number of rows, grouped by all columns.我想计算按所有列分组的行数。 Of course we may write all variables in by , like this:当然我们可以把所有的变量都写在by中,像这样:

iris[, .(Freq = .N), by = .(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)]



   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
1:          5.1         3.5          1.4         0.2  setosa    1
2:          4.9         3.0          1.4         0.2  setosa    1
3:          4.7         3.2          1.3         0.2  setosa    1
4:          4.6         3.1          1.5         0.2  setosa    1
5:          5.0         3.6          1.4         0.2  setosa    1
6:          5.4         3.9          1.7         0.4  setosa    1

However, I wonder if there is a method to group by all variables without needing to type all the columns names?但是,我想知道是否有一种方法可以按所有变量分组而无需键入所有列名?

In case you are looking for duplicates, uniqueN will default to using all columns:如果您正在查找重复项, uniqueN将默认使用所有列:

uniqueN(as.data.table(iris))
# [1] 149

This doesn't answer your question directly, but it might be a more direct way of accomplishing what you were trying to do in the first place.这并不能直接回答您的问题,但它可能是一种更直接的方式来完成您最初尝试做的事情。

Similarly, if you're looking for which rows are duplicated, you can use duplicated 's data.table method which similarly defaults to using all columns:同样,如果您要查找重复的行,则可以使用duplicateddata.table方法,该方法同样默认使用所有列:

iris[duplicated(iris)]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1:          5.8         2.7          5.1         1.9 virginica

Here is an approach in Base-R这是Base-R中的一种方法

Freq <- table(apply(iris,1,paste0, collapse=" "))
iris$Freq <- apply(iris,1, function(x) Freq[names(Freq) %in% paste0(x,collapse=" ")])

output: output:

> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Freq
...          ...         ...          ...         ...  ...        ...
140          6.9         3.1          5.4         2.1  virginica    1
141          6.7         3.1          5.6         2.4  virginica    1
142          6.9         3.1          5.1         2.3  virginica    1
143          5.8         2.7          5.1         1.9  virginica    2
144          6.8         3.2          5.9         2.3  virginica    1
145          6.7         3.3          5.7         2.5  virginica    1

We can use我们可以用

library(data.table)
out1 <- as.data.table(iris)[, .N, by = names(iris)]

-checking with OP's approach -检查OP的方法

out2 <-  as.data.table(iris)[,  .N, by = .(Sepal.Length, 
      Sepal.Width, Petal.Length, Petal.Width, Species)]
identical(out1, out2)
#[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM