foo <- data.table(x = 1:5/sum(1:5),
y = (-4):0/sum((-4):0),
z1 = 2:6/sum(2:6),
z2 = 2:6/sum(2:6))
Suppose I have the foo
data table (as specified above):
x y z1 z2
1: 0.06666667 0.4 0.10 0.10
2: 0.13333333 0.3 0.15 0.15
3: 0.20000000 0.2 0.20 0.20
4: 0.26666667 0.1 0.25 0.25
5: 0.33333333 0.0 0.30 0.30
How to efficiently count unique columns? In this case only 3.
Please assume that in general:
foo
is always a data table and not a matrix; though the columns are always numeric. foo
in reality is big, nrow > 20k and ncol > 100 Is it possible to do this without making extra copies of the data?
My current approach is to apply
over columns with paste
to get a single value for each column and then do length(unique(.))
on the result...
Is there any magic with data.table::transpose()
, data.table::uniqueN
, and maybe some other friends?
Another possibility:
length(unique(as.list(foo)))
Which gives the expected result:
> length(unique(as.list(foo))) [1] 3
NOTE: the use of length(unique())
is necessary as uniqueN()
will return an error.
Per the comment of @Ryan, you can also do:
length(unique.default(foo))
With regard to speed, both methods are comparable (when measured on a larger dataset of 5M rows):
> fooLarge <- foo[rep(1:nrow(foo),1e6)] > microbenchmark(length(unique.default(fooLarge)), length(unique(as.list(fooLarge)))) Unit: milliseconds expr min lq mean median uq max neval cld length(unique.default(fooLarge)) 94.0433 94.56920 95.24076 95.01492 95.67131 103.15433 100 a length(unique(as.list(fooLarge))) 94.0254 94.68187 95.17648 95.02672 95.49857 99.19411 100 a
If you want to retain only the unique columns, you could use:
# option 1
cols <- !duplicated(as.list(foo))
foo[, ..cols]
# option 2 (doesn't retain the column names)
as.data.table(unique.default(foo))
which gives (output option 1 shown):
xy z1 1: 0.06666667 0.4 0.10 2: 0.13333333 0.3 0.15 3: 0.20000000 0.2 0.20 4: 0.26666667 0.1 0.25 5: 0.33333333 0.0 0.30
transpose and check for non-duplicates
ncol( foo[ , which( !duplicated( t( foo ) ) ), with = FALSE ])
3
Another method which may be faster if you expect a large number of duplicates:
n_unique_cols <- function(foo) {
K <- seq_along(foo)
for (j in seq_along(foo)) {
if (j %in% K) {
foo_j <- .subset2(foo, j)
for (k in K) {
if (j < k) {
foo_k <- .subset2(foo, k)
if (foo_j[1] == foo_k[1] && identical(foo_j, foo_k)) {
K <- K[K != k]
}
rm(foo_k)
}
}
}
}
length(K)
}
Timings:
library(data.table)
create_foo <- function(row, col) {
foo <- data.table(x = rnorm(row),
y = seq_len(row) - 2L)
set.seed(1)
for (k in seq_len(col %/% 2L)) {
foo[, (paste0('x', k)) := x + sample(-4:4, size = 1)]
foo[, (paste0('y', k)) := y + sample(-2:2, size = 1)]
}
foo
}
library(bench)
res <-
press(rows = c(1e5, 1e6, 1e7),
cols = c(10, 50, 100),
{
foorc <- create_foo(rows, cols)
bench::mark(n_unique_cols(foorc),
length(unique(as.list(foorc))))
})
plot(res)
For this family of data, this function is twice as fast, but its memory consumption grows faster than unique(as.list(.))
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.