简体   繁体   English

基于具有重复行的两列R数据帧计算唯一值

[英]Counting unique values based on two columns with repeated rows, R data frame

I have an R data frame with the following format: 我有一个R数据框,格式如下:

column1    column2
NA         NA
1          A
1          A
1          A
NA         NA
NA         NA
2          B
2          B
NA         NA
NA         NA
3          A
3          A
3          A

df = structure(list(column1 = c(NA, 1L, 1L, 1L, NA, NA, 2L, 2L, NA, 
NA, 3L, 3L, 3L), column2 = c(NA, "A", "A", "A", NA, NA, "B", 
"B", NA, NA, "A", "A", "A")), .Names = c("column1", "column2"
), row.names = c(NA, -13L), class = "data.frame")

If the row in one column has an NA , the other column has an NA . 如果一列中的行具有NA ,则另一列具有NA The numerical value in column1 describes a unique group, eg rows 2-4 have the group 1 . column1中的数值描述了唯一的组,例如,行2-4具有组1 The column column2 describes the identity of this grouping. column2描述了该分组的标识。 In this data frame, the identity is either A , B , C , or D . 在该数据帧中,标识是ABCD

My goal is to tally the number of identities by group within the entire data frame: how many A groups there are, how many B groups, etc. 我的目标是在整个数据框中按组计算身份的数量:有多少A组,有多少B组等等。

The correct output for this file (so far) is there are 2 A groups and 1 B group. 此文件的正确输出(到目前为止)是有2个A组和1个B组。

How would I calculate this? 我该如何计算?

At the moment, I would try something like this: 目前,我会尝试这样的事情:

length(df[df$column2 == "B"]) ## outputs 2 

but this is incorrect. 但这是不正确的。 If I combined column1 and column2 , took only unique values 1A, 2B, 3A, I guess I could count how many times each label from column2 occurs? 如果我将column1column2组合在一起,只采用了唯一值1A,2B,3A,我想我可以计算出来自column2每个标签出现多少次?

(If it's easier, I'm happy to use data.table for this task.) (如果它更容易,我很乐意使用data.table来完成这项任务。)

You can use rle for runs and table for tabulation: 您可以使用rle for runs和table进行制表:

table(rle(df$column2)$values)

# A B 
# 2 1 

See ?rle and ?table for details. 有关详细信息,请参阅?rle?table

Or, if you want to take advantage of column1 (which is derived from column2 ): 或者,如果您想利用column1 (派生自column2 ):

table(unique(df)$column2)

The 'dplyr' package has simple functions for this 'dplyr'包具有简单的功能

library(dplyr)

df %>%
  filter(complete.cases(.) & !duplicated(.)) %>% 
  group_by(column2) %>%
  summarize(count = n())
  1. Filter out rows with NA 使用NA过滤掉行
  2. Filter out duplicated rows; 过滤掉重复的行; these represent individuals in the same group 这些代表同一组中的个体
  3. Group by the identity variable (column2) 按标识变量分组(column2)
  4. Count the number of unique groups (column1) 计算唯一组的数量(column1)

If you want to use data.table: 如果你想使用data.table:

library(data.table)
setDT(df)

d <- df[!is.na(column1), list(n=.N), by=list(column2,column1)]
d <- d[, list(n=.N), by=list(column2)]
d
   column2 n
1:       A 2
2:       B 1

Or more concisely as a one-liner: 或者更简洁地作为单线:

setDT(df)[!is.na(column1), .N, by = .(column2, column1)][, .N, by = column2]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM