基于具有重复行的两列R数据帧计算唯一值

Question

I have an R data frame with the following format: 我有一个R数据框，格式如下：

column1    column2
NA         NA
1          A
1          A
1          A
NA         NA
NA         NA
2          B
2          B
NA         NA
NA         NA
3          A
3          A
3          A

df = structure(list(column1 = c(NA, 1L, 1L, 1L, NA, NA, 2L, 2L, NA, 
NA, 3L, 3L, 3L), column2 = c(NA, "A", "A", "A", NA, NA, "B", 
"B", NA, NA, "A", "A", "A")), .Names = c("column1", "column2"
), row.names = c(NA, -13L), class = "data.frame")

If the row in one column has an NA , the other column has an NA . 如果一列中的行具有NA ，则另一列具有NA 。 The numerical value in column1 describes a unique group, eg rows 2-4 have the group 1 . column1中的数值描述了唯一的组，例如，行2-4具有组1 。 The column column2 describes the identity of this grouping. 列column2描述了该分组的标识。 In this data frame, the identity is either A , B , C , or D . 在该数据帧中，标识是A ， B ， C或D

My goal is to tally the number of identities by group within the entire data frame: how many A groups there are, how many B groups, etc. 我的目标是在整个数据框中按组计算身份的数量：有多少A组，有多少B组等等。

The correct output for this file (so far) is there are 2 A groups and 1 B group. 此文件的正确输出（到目前为止）是有2个A组和1个B组。

How would I calculate this? 我该如何计算？

At the moment, I would try something like this: 目前，我会尝试这样的事情：

length(df[df$column2 == "B"]) ## outputs 2

but this is incorrect. 但这是不正确的。 If I combined column1 and column2 , took only unique values 1A, 2B, 3A, I guess I could count how many times each label from column2 occurs? 如果我将column1和column2组合在一起，只采用了唯一值1A，2B，3A，我想我可以计算出来自column2每个标签出现多少次？

(If it's easier, I'm happy to use data.table for this task.) （如果它更容易，我很乐意使用data.table来完成这项任务。）

Answer 1

You can use rle for runs and table for tabulation: 您可以使用rle for runs和table进行制表：

table(rle(df$column2)$values)

# A B 
# 2 1

See ?rle and ?table for details. 有关详细信息，请参阅?rle和?table 。

Or, if you want to take advantage of column1 (which is derived from column2 ): 或者，如果您想利用column1 （派生自column2 ）：

table(unique(df)$column2)

Answer 2

The 'dplyr' package has simple functions for this 'dplyr'包具有简单的功能

library(dplyr)

df %>%
  filter(complete.cases(.) & !duplicated(.)) %>% 
  group_by(column2) %>%
  summarize(count = n())

Filter out rows with NA 使用NA过滤掉行
Filter out duplicated rows; 过滤掉重复的行; these represent individuals in the same group 这些代表同一组中的个体
Group by the identity variable (column2) 按标识变量分组（column2）
Count the number of unique groups (column1) 计算唯一组的数量（column1）

Answer 3

If you want to use data.table: 如果你想使用data.table：

library(data.table)
setDT(df)

d <- df[!is.na(column1), list(n=.N), by=list(column2,column1)]
d <- d[, list(n=.N), by=list(column2)]
d
   column2 n
1:       A 2
2:       B 1

Or more concisely as a one-liner: 或者更简洁地作为单线：

setDT(df)[!is.na(column1), .N, by = .(column2, column1)][, .N, by = column2]

基于具有重复行的两列R数据帧计算唯一值

问题描述

3 个解决方案

解决方案1
4 已采纳 2017-04-11 19:56:40

解决方案2
4 2017-04-11 19:58:53

解决方案3
4 2017-04-11 20:03:48

基于具有重复行的两列R数据帧计算唯一值

问题描述

3 个解决方案

解决方案1 4 已采纳 2017-04-11 19:56:40

解决方案2 4 2017-04-11 19:58:53

解决方案3 4 2017-04-11 20:03:48

解决方案1
4 已采纳 2017-04-11 19:56:40

解决方案2
4 2017-04-11 19:58:53

解决方案3
4 2017-04-11 20:03:48