简体   繁体   English

如何计算 R 中唯一值的总和

[英]How to calculate sum on unique values in R

So here's the data:所以这是数据:

DF1 DF1

ID  DOW     
1   Monday
1   Monday
1   Tuesday
2   Tuesday
2   Wednesday
3   Friday
3   Monday
3   Tuesday

I would like to join the following dictionary.我想加入以下词典。

DF2 DF2

ID DOW        Hours
1  Monday     20
1  Tuesday    21
2  Tuesday    30
2  Wednesday  25
3  Friday     24
3  Monday     42
3  Tuesday    54

My goal is I want the total count of entries on each day as well as the hours worked on that day.我的目标是我想要每天的条目总数以及当天的工作时间。 But if a value on the list exists twice, it is not counted twice.但如果列表中的某个值存在两次,则不会被计算两次。 (Thats the hard part) (那是最难的部分)

Here's my attempt following R Code:这是我遵循 R 代码的尝试:

df3 <- df1 %>% 
  left_join(df2, by = c("DOW" ,"ID"))

df3 %>% 
  group_by(ID) %>% 
  summarize(count = n())
            sum = sum(Employee_Hrs)) %>% 
  mutate(injRate = count/sum)

This does not work because though it does successfully count total number of entries for each ID, it sums employee_Hrs every time, even when it is entered multiple times...这是行不通的,因为尽管它确实成功地计算了每个 ID 的条目总数,但它每次都会对 employee_Hrs 求和,即使多次输入也是如此......

End product should be:最终产品应该是:

ID count    sum
1      3     41
2      2     55
3      3    120

Again, take count, but sum hours, dont double count.再次强调,计算时间,但不要重复计算。

Another approach is to summarize the tables prior to joining them.另一种方法是在加入表格之前对其进行汇总。

textFile1 <- "ID  DOW     
1   Monday
1   Monday
1   Tuesday
2   Tuesday
2   Wednesday
3   Friday
3   Monday
3   Tuesday"

textFile2 <- "ID DOW        Hours
1  Monday     20
1  Tuesday    21
2  Tuesday    30
2  Wednesday  25
3  Friday     24
3  Monday     42
3  Tuesday    54"
df1 <- read.table(text =textFile1,header=TRUE )
df2 <- read.table(text =textFile2,header=TRUE )

df1 %>% group_by(ID) %>%
        summarise(count = n()) -> counts 
df2 %>% 
     group_by(ID) %>% 
     summarize(sum = sum(Hours)) %>% 
     left_join(counts) %>% 
     mutate(injRate = count/sum)

...and the output: ...和输出:

# A tibble: 3 x 4
     ID   sum count injRate
  <int> <int> <int>   <dbl>
1     1    41     3  0.0732
2     2    55     2  0.0364
3     3   120     3  0.025 

Here is a base R option using merge + aggregate这是使用merge + aggregate的基本 R 选项

u <- merge(df1, df2, by = c("ID", "DOW"))
res <- setNames(
  merge(aggregate(DOW ~ ID, u, length),
    aggregate(Hours ~ ID, unique(u), sum),
    by = "ID"
  ),
  c("ID", "Count", "Sum")
)

which gives这使

> res
  ID Count Sum
1  1     3  41
2  2     2  55
3  3     3 120

An option with data.table data.table的一个选项

library(data.table)
setDT(df1)[df2, .(Count = .N, Hours), on = .(ID), by = .EACHI][,
    .(Sum = sum(Hours)), .(ID, Count)]
#   ID Count Sum
#1:  1     3  41
#2:  2     2  55
#3:  3     3 120

Try this solution where you compute the number of counts and then you filter to obtain final summary:试试这个解决方案,计算计数,然后过滤以获得最终摘要:

library(tidyverse)
#Data
df3 <- df1 %>% 
  left_join(df2, by = c("DOW" ,"ID"))
#Code
df3 %>% 
  group_by(ID) %>% 
  mutate(count=n()) %>%
  filter(!duplicated(DOW)) %>%
  summarise(count=unique(count),Sum=sum(Hours))

Output:输出:

# A tibble: 3 x 3
     ID count   Sum
  <int> <int> <int>
1     1     3    41
2     2     2    55
3     3     3   120

this worked for me.这对我有用。 A combination of dplyr and base R. dplyr 和基础 R 的组合。

df %>%
summarise(total = sum(value[!duplicated(value)]))

Here, I'm indexing value by a vector of TRUEs and FALSEs.在这里,我通过 TRUE 和 FALSE 的向量来索引值。 Try this first:先试试这个:

!duplicated(value)

You'll see that it produces a vector of TRUEs and FALSEs.你会看到它产生了一个 TRUE 和 FALSE 的向量。 And this:和这个:

value[!duplicated(value)]

Picks only non-duplicated values.仅选取非重复值。 So this:所以这:

sum(value[!duplicated(value)])

Simply sums them up.简单总结一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM