[英]Counting unique values based on two columns with repeated rows, R data frame
I have an R data frame with the following format: 我有一个R数据框,格式如下:
column1 column2
NA NA
1 A
1 A
1 A
NA NA
NA NA
2 B
2 B
NA NA
NA NA
3 A
3 A
3 A
df = structure(list(column1 = c(NA, 1L, 1L, 1L, NA, NA, 2L, 2L, NA,
NA, 3L, 3L, 3L), column2 = c(NA, "A", "A", "A", NA, NA, "B",
"B", NA, NA, "A", "A", "A")), .Names = c("column1", "column2"
), row.names = c(NA, -13L), class = "data.frame")
If the row in one column has an NA
, the other column has an NA
. 如果一列中的行具有
NA
,则另一列具有NA
。 The numerical value in column1
describes a unique group, eg rows 2-4 have the group 1
. column1
中的数值描述了唯一的组,例如,行2-4具有组1
。 The column column2
describes the identity of this grouping. 列
column2
描述了该分组的标识。 In this data frame, the identity is either A
, B
, C
, or D
. 在该数据帧中,标识是
A
, B
, C
或D
My goal is to tally the number of identities by group within the entire data frame: how many A groups there are, how many B groups, etc. 我的目标是在整个数据框中按组计算身份的数量:有多少A组,有多少B组等等。
The correct output for this file (so far) is there are 2 A groups and 1 B group. 此文件的正确输出(到目前为止)是有2个A组和1个B组。
How would I calculate this? 我该如何计算?
At the moment, I would try something like this: 目前,我会尝试这样的事情:
length(df[df$column2 == "B"]) ## outputs 2
but this is incorrect. 但这是不正确的。 If I combined
column1
and column2
, took only unique values 1A, 2B, 3A, I guess I could count how many times each label from column2
occurs? 如果我将
column1
和column2
组合在一起,只采用了唯一值1A,2B,3A,我想我可以计算出来自column2
每个标签出现多少次?
(If it's easier, I'm happy to use data.table
for this task.) (如果它更容易,我很乐意使用
data.table
来完成这项任务。)
You can use rle
for runs and table
for tabulation: 您可以使用
rle
for runs和table
进行制表:
table(rle(df$column2)$values)
# A B
# 2 1
See ?rle
and ?table
for details. 有关详细信息,请参阅
?rle
和?table
。
Or, if you want to take advantage of column1
(which is derived from column2
): 或者,如果您想利用
column1
(派生自column2
):
table(unique(df)$column2)
The 'dplyr' package has simple functions for this 'dplyr'包具有简单的功能
library(dplyr)
df %>%
filter(complete.cases(.) & !duplicated(.)) %>%
group_by(column2) %>%
summarize(count = n())
If you want to use data.table: 如果你想使用data.table:
library(data.table)
setDT(df)
d <- df[!is.na(column1), list(n=.N), by=list(column2,column1)]
d <- d[, list(n=.N), by=list(column2)]
d
column2 n
1: A 2
2: B 1
Or more concisely as a one-liner: 或者更简洁地作为单线:
setDT(df)[!is.na(column1), .N, by = .(column2, column1)][, .N, by = column2]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.