简体   繁体   English

删除重复项,但将值保留在R中

[英]Remove duplicates but keeping values in R

I have a dataframe with duplicate store/product combinations. 我有一个具有重复的商店/产品组合的数据框。 I want to remove the duplicate values, but I want to keep the costs for these products for each year. 我想删除重复的值,但我想保留这些产品的每年费用。

example dataframe: 示例数据框:

store    product    year1  year2  year3 
H&M      shirt      20.00  29.95  NA
Mango    trousers   49.95  NA     NA
H&M      trousers   39.95  NA     39.95
Mango    trousers   NA     NA     44.95

How I want the dataset to look: 我希望数据集如何显示:

store    product    year1  year2  year3 
H&M      shirt      20.00  29.95  NA
H&M      trousers   39.95  NA     39.95
Mango    trousers   49.95  NA     44.95

I've used dplyr but this only seemed to remove the duplicates, instead of keeping all the costs values. 我使用了dplyr,但这似乎只是删除重复项,而不是保留所有成本值。 Any help is appreciated! 任何帮助表示赞赏!

reproducible code: 可复制的代码:

df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"), product=c("shirt", "trousers", "trousers", "trousers"), 
                 year1=c(20.95, 49.95, 39.95, NA), year2=c(29.95, NA, NA, NA), year3=c(NA,NA,39.95, 44.95))

You can use the package dplyr . 您可以使用dplyr软件包。

dfn<- df %>%
  group_by(store, product) %>%
  summarise(year1 = sum(year1, na.rm = T),
            year2 = sum(year2, na.rm = T),
            year3 = sum(year3, na.rm = T))

When you print out dfn , you get 打印dfn时 ,您得到

   store  product year1 year2 year3
  <fctr>   <fctr> <dbl> <dbl> <dbl>
1    H&M    shirt 20.95 29.95  0.00
2    H&M trousers 39.95  0.00 39.95
3  Mango trousers 49.95  0.00 44.95

You wanted to group by two variables, so the group_by function is best suited for it. 您想按两个变量分组,所以group_by函数最适合它。 I know that you want NAs for where the 0s are, and you can replace that in a subsequent line as 我知道您想要NA代表0,而您可以在下一行中将其替换为

dfn[dfn == 0, ] <- NA

Indeed dplyr is the way to go. 确实, dplyr是必经之路。 First you gather() the data, then you group_by() and summarize() and eventually spread() it back, filling with NAs where missing, ie: 首先,你gather()中的数据,那么你group_by()summarize()并最终spread()回来,与在缺少,即来港填充:

library(dplyr)
df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"), 
                 product=c("shirt", "trousers", "trousers", "trousers"), 
                 year1=c(20.95, 49.95, 39.95, NA), 
                 year2=c(29.95, NA, NA, NA), 
                 year3=c(NA,NA,39.95, 44.95))
new.df <- df %>%
  gather(year, value, -store, -product) %>%
  group_by(year, store, product) %>%
  summarize(sum.value = sum(value)) %>%
  spread(key = year, value = sum.value, fill = NA)

Using -store and -product tells gather() to ignore these two variables and gather the data by year and call the new number column "value" (you can replace this with whatever name you like). 使用-store-product告诉-product gather()忽略这两个变量,并按年份收集数据,并将新的数字列称为“值”(您可以将其替换为任意名称)。

Then group_by() and summarize() makes sure we don't run into duplicates (and use the sum of two values in case there are many rows relating to the same store and product). 然后group_by()summarize()可以确保我们不会陷入重复(和使用两个值之和的情况下有涉及同一家商店,产品多行)。

Eventually spread() gives the form you are looking for. 最终, spread()给出了您想要的形式。

You have to be careful with how you treat duplicates and what you assume about them. 您必须谨慎对待重复项以及对重复项的假设。 This answer assumes that if there are two rows which have the same product and store, appear twice, then the value you want as a result is the sum of year1, sum of year2 and sum of year3. 此答案假定,如果有两行具有相同的产品和商店,并出现两次,那么结果所需的值就是year1的总和,year2的总和和year3的总和。 If NAs are present (in the group_by() groups, you will get an NA as a result, unless you add na.rm = TRUE in the sum command, ie: summarize(sum.value = sum(value, na.rm = TRUE)) . Then you will have 0s instead of NAs. 如果存在NA(在group_by()组中,则将得到NA,除非在sum命令中添加na.rm = TRUE ,例如: summarize(sum.value = sum(value, na.rm = TRUE)) ,那么您将拥有0而不是NA。

However, the code I supplied works for the example you supplied, and yields the tibble you wanted. 但是,我提供的代码适用于您提供的示例,并产生您想要的小标题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM