简体   繁体   English

如何在 data.frame 中创建一个新列,以便该列计算该 data.frame 中不同行的数量?

[英]How to make a new column in a data.frame so that column counts the number of different row in that data.frame?

I have a huge data.frame like this.我有一个像这样的巨大 data.frame。

First, how can I add a new column "date1" into this data.frame so that column counts the number of the UNIQUE different day in this data.frame then arranged in ascending order in that newly created column.首先,如何将新列“date1”添加到此 data.frame 中,以便该列计算此 data.frame 中 UNIQUE 不同日期的数量,然后在该新创建的列中按升序排列。

Second, how can I add another column "date2" into this data.frame so that column counts total different id in a day?其次,如何将另一列“date2”添加到此 data.frame 中,以便该列计算一天中不同的 id 总数?

    year  month day id
    2011    1   5   31
    2011    1   14  22
    2011    2   6   28
    2011    2   17  41
    2011    3   9   55
    2011    1   5   34
    2011    1   14  25
    2011    2   6   36
    2011    2   17  11
    2011    3   12  10

The result I expect looks like this.我期望的结果看起来像这样。 Please help!请帮忙!

    year month day  id date1 date2
    2011    1   5   31  1     2
    2011    1   14  22  2     2
    2011    2   6   28  3     2
    2011    2   17  41  4     2
    2011    3   9   55  5     1
    2011    1   5   34  1     2
    2011    1   14  25  2     2
    2011    2   6   36  3     2
    2011    2   17  11  4     2
    2011    3   12  10  6     1

We can first combine year , month and day into one column using unite and give a unique number to each group of that combination, then group_by same combination and count the unique id for each combination using n_distinct .我们可以首先使用uniteyearmonthday成一列,并为该组合的每个组提供唯一编号,然后group_by相同组合并使用n_distinct计算每个组合的唯一id

library(dplyr)
library(tidyr)

df %>%
  unite(date, year, month, day, sep = "-", remove = FALSE) %>%
  mutate(date1 = as.integer(factor(date,level = unique(date)))) %>%
  group_by(date) %>%
  mutate(date2 = n_distinct(id)) %>%
  ungroup() %>%
  select(-date)


#    year month   day    id date1 date2
#   <int> <int> <int> <int> <int> <int>
# 1  2011     1     5    31     1     2
# 2  2011     1    14    22     2     2
# 3  2011     2     6    28     3     2
# 4  2011     2    17    41     4     2
# 5  2011     3     9    55     5     1
# 6  2011     1     5    34     1     2
# 7  2011     1    14    25     2     2
# 8  2011     2     6    36     3     2
# 9  2011     2    17    11     4     2
#10  2011     3    12    10     6     1

We can do this more compactly in tidyverse by getting the group_indices of 'year', 'month', 'day' in the group_by and then create the 'date2' as the number of distinct elements of 'id' ( n_distinct )我们可以在更紧凑做到这一点tidyverse通过获取group_indices在“年”,“月”,“日”的group_by ,然后创建“日期2”作为“身份证”的不同元素的数量( n_distinct

librarytidyverse)
df1 %>% 
     group_by(date1 = group_indices(., year, month, day)) %>% 
     mutate(date2 = n_distinct(id))
# A tibble: 10 x 6
# Groups:   date1 [6]
#    year month   day    id date1 date2
#   <int> <int> <int> <int> <int> <int>
# 1  2011     1     5    31     1     2
# 2  2011     1    14    22     2     2
# 3  2011     2     6    28     3     2
# 4  2011     2    17    41     4     2
# 5  2011     3     9    55     5     1
# 6  2011     1     5    34     1     2
# 7  2011     1    14    25     2     2
# 8  2011     2     6    36     3     2
# 9  2011     2    17    11     4     2
#10  2011     3    12    10     6     1

Or another compact option with data.table (using the same logic)或者另一个带有data.table紧凑选项(使用相同的逻辑)

library(data.table)
setDT(df1)[, date1 := .GRP, .(year, month, day)][, date2 := uniqueN(id), date1][]
#     year month day id date1 date2
# 1: 2011     1   5 31     1     2
# 2: 2011     1  14 22     2     2
# 3: 2011     2   6 28     3     2
# 4: 2011     2  17 41     4     2
# 5: 2011     3   9 55     5     1
# 6: 2011     1   5 34     1     2
# 7: 2011     1  14 25     2     2
# 8: 2011     2   6 36     3     2
# 9: 2011     2  17 11     4     2
#10: 2011     3  12 10     6     1

Or this can be done with interaction and ave from base R或者这可以通过base R interactionave来完成

df1$date1 <- with(df1, as.integer(interaction(year, month, day, 
         drop = TRUE, lex.order = TRUE)))
df1$date2 <- with(df1, ave(id, date1, FUN = function(x) length(unique(x))))

data数据

df1 <- structure(list(year = c(2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
2011L, 2011L, 2011L, 2011L), month = c(1L, 1L, 2L, 2L, 3L, 1L, 
1L, 2L, 2L, 3L), day = c(5L, 14L, 6L, 17L, 9L, 5L, 14L, 6L, 17L, 
12L), id = c(31L, 22L, 28L, 41L, 55L, 34L, 25L, 36L, 11L, 10L
)), class = "data.frame", row.names = c(NA, -10L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM