[英]How to make a new column in a data.frame so that column counts the number of different row in that data.frame?
I have a huge data.frame like this.我有一个像这样的巨大 data.frame。
First, how can I add a new column "date1" into this data.frame so that column counts the number of the UNIQUE different day in this data.frame then arranged in ascending order in that newly created column.首先,如何将新列“date1”添加到此 data.frame 中,以便该列计算此 data.frame 中 UNIQUE 不同日期的数量,然后在该新创建的列中按升序排列。
Second, how can I add another column "date2" into this data.frame so that column counts total different id in a day?其次,如何将另一列“date2”添加到此 data.frame 中,以便该列计算一天中不同的 id 总数?
year month day id
2011 1 5 31
2011 1 14 22
2011 2 6 28
2011 2 17 41
2011 3 9 55
2011 1 5 34
2011 1 14 25
2011 2 6 36
2011 2 17 11
2011 3 12 10
The result I expect looks like this.我期望的结果看起来像这样。 Please help!
请帮忙!
year month day id date1 date2
2011 1 5 31 1 2
2011 1 14 22 2 2
2011 2 6 28 3 2
2011 2 17 41 4 2
2011 3 9 55 5 1
2011 1 5 34 1 2
2011 1 14 25 2 2
2011 2 6 36 3 2
2011 2 17 11 4 2
2011 3 12 10 6 1
We can first combine year
, month
and day
into one column using unite
and give a unique number to each group of that combination, then group_by
same combination and count the unique id
for each combination using n_distinct
.我们可以首先使用
unite
将year
、 month
和day
成一列,并为该组合的每个组提供唯一编号,然后group_by
相同组合并使用n_distinct
计算每个组合的唯一id
。
library(dplyr)
library(tidyr)
df %>%
unite(date, year, month, day, sep = "-", remove = FALSE) %>%
mutate(date1 = as.integer(factor(date,level = unique(date)))) %>%
group_by(date) %>%
mutate(date2 = n_distinct(id)) %>%
ungroup() %>%
select(-date)
# year month day id date1 date2
# <int> <int> <int> <int> <int> <int>
# 1 2011 1 5 31 1 2
# 2 2011 1 14 22 2 2
# 3 2011 2 6 28 3 2
# 4 2011 2 17 41 4 2
# 5 2011 3 9 55 5 1
# 6 2011 1 5 34 1 2
# 7 2011 1 14 25 2 2
# 8 2011 2 6 36 3 2
# 9 2011 2 17 11 4 2
#10 2011 3 12 10 6 1
We can do this more compactly in tidyverse
by getting the group_indices
of 'year', 'month', 'day' in the group_by
and then create the 'date2' as the number of distinct elements of 'id' ( n_distinct
)我们可以在更紧凑做到这一点
tidyverse
通过获取group_indices
在“年”,“月”,“日”的group_by
,然后创建“日期2”作为“身份证”的不同元素的数量( n_distinct
)
librarytidyverse)
df1 %>%
group_by(date1 = group_indices(., year, month, day)) %>%
mutate(date2 = n_distinct(id))
# A tibble: 10 x 6
# Groups: date1 [6]
# year month day id date1 date2
# <int> <int> <int> <int> <int> <int>
# 1 2011 1 5 31 1 2
# 2 2011 1 14 22 2 2
# 3 2011 2 6 28 3 2
# 4 2011 2 17 41 4 2
# 5 2011 3 9 55 5 1
# 6 2011 1 5 34 1 2
# 7 2011 1 14 25 2 2
# 8 2011 2 6 36 3 2
# 9 2011 2 17 11 4 2
#10 2011 3 12 10 6 1
Or another compact option with data.table
(using the same logic)或者另一个带有
data.table
紧凑选项(使用相同的逻辑)
library(data.table)
setDT(df1)[, date1 := .GRP, .(year, month, day)][, date2 := uniqueN(id), date1][]
# year month day id date1 date2
# 1: 2011 1 5 31 1 2
# 2: 2011 1 14 22 2 2
# 3: 2011 2 6 28 3 2
# 4: 2011 2 17 41 4 2
# 5: 2011 3 9 55 5 1
# 6: 2011 1 5 34 1 2
# 7: 2011 1 14 25 2 2
# 8: 2011 2 6 36 3 2
# 9: 2011 2 17 11 4 2
#10: 2011 3 12 10 6 1
Or this can be done with interaction
and ave
from base R
或者这可以通过
base R
interaction
和ave
来完成
df1$date1 <- with(df1, as.integer(interaction(year, month, day,
drop = TRUE, lex.order = TRUE)))
df1$date2 <- with(df1, ave(id, date1, FUN = function(x) length(unique(x))))
df1 <- structure(list(year = c(2011L, 2011L, 2011L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L), month = c(1L, 1L, 2L, 2L, 3L, 1L,
1L, 2L, 2L, 3L), day = c(5L, 14L, 6L, 17L, 9L, 5L, 14L, 6L, 17L,
12L), id = c(31L, 22L, 28L, 41L, 55L, 34L, 25L, 36L, 11L, 10L
)), class = "data.frame", row.names = c(NA, -10L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.