[英]How to calculate sum on unique values in R
So here's the data:所以这是数据:
DF1 DF1
ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday
I would like to join the following dictionary.我想加入以下词典。
DF2 DF2
ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54
My goal is I want the total count of entries on each day as well as the hours worked on that day.我的目标是我想要每天的条目总数以及当天的工作时间。 But if a value on the list exists twice, it is not counted twice.
但如果列表中的某个值存在两次,则不会被计算两次。 (Thats the hard part)
(那是最难的部分)
Here's my attempt following R Code:这是我遵循 R 代码的尝试:
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
df3 %>%
group_by(ID) %>%
summarize(count = n())
sum = sum(Employee_Hrs)) %>%
mutate(injRate = count/sum)
This does not work because though it does successfully count total number of entries for each ID, it sums employee_Hrs every time, even when it is entered multiple times...这是行不通的,因为尽管它确实成功地计算了每个 ID 的条目总数,但它每次都会对 employee_Hrs 求和,即使多次输入也是如此......
End product should be:最终产品应该是:
ID count sum
1 3 41
2 2 55
3 3 120
Again, take count, but sum hours, dont double count.再次强调,计算时间,但不要重复计算。
Another approach is to summarize the tables prior to joining them.另一种方法是在加入表格之前对其进行汇总。
textFile1 <- "ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday"
textFile2 <- "ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54"
df1 <- read.table(text =textFile1,header=TRUE )
df2 <- read.table(text =textFile2,header=TRUE )
df1 %>% group_by(ID) %>%
summarise(count = n()) -> counts
df2 %>%
group_by(ID) %>%
summarize(sum = sum(Hours)) %>%
left_join(counts) %>%
mutate(injRate = count/sum)
...and the output: ...和输出:
# A tibble: 3 x 4
ID sum count injRate
<int> <int> <int> <dbl>
1 1 41 3 0.0732
2 2 55 2 0.0364
3 3 120 3 0.025
Here is a base R option using merge
+ aggregate
这是使用
merge
+ aggregate
的基本 R 选项
u <- merge(df1, df2, by = c("ID", "DOW"))
res <- setNames(
merge(aggregate(DOW ~ ID, u, length),
aggregate(Hours ~ ID, unique(u), sum),
by = "ID"
),
c("ID", "Count", "Sum")
)
which gives这使
> res
ID Count Sum
1 1 3 41
2 2 2 55
3 3 3 120
An option with data.table
data.table
的一个选项
library(data.table)
setDT(df1)[df2, .(Count = .N, Hours), on = .(ID), by = .EACHI][,
.(Sum = sum(Hours)), .(ID, Count)]
# ID Count Sum
#1: 1 3 41
#2: 2 2 55
#3: 3 3 120
Try this solution where you compute the number of counts and then you filter to obtain final summary:试试这个解决方案,计算计数,然后过滤以获得最终摘要:
library(tidyverse)
#Data
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
#Code
df3 %>%
group_by(ID) %>%
mutate(count=n()) %>%
filter(!duplicated(DOW)) %>%
summarise(count=unique(count),Sum=sum(Hours))
Output:输出:
# A tibble: 3 x 3
ID count Sum
<int> <int> <int>
1 1 3 41
2 2 2 55
3 3 3 120
this worked for me.这对我有用。 A combination of dplyr and base R.
dplyr 和基础 R 的组合。
df %>%
summarise(total = sum(value[!duplicated(value)]))
Here, I'm indexing value by a vector of TRUEs and FALSEs.在这里,我通过 TRUE 和 FALSE 的向量来索引值。 Try this first:
先试试这个:
!duplicated(value)
You'll see that it produces a vector of TRUEs and FALSEs.你会看到它产生了一个 TRUE 和 FALSE 的向量。 And this:
和这个:
value[!duplicated(value)]
Picks only non-duplicated values.仅选取非重复值。 So this:
所以这:
sum(value[!duplicated(value)])
Simply sums them up.简单总结一下。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.