[英]Remove duplicates but keeping values in R
I have a dataframe with duplicate store/product combinations. 我有一个具有重复的商店/产品组合的数据框。 I want to remove the duplicate values, but I want to keep the costs for these products for each year.
我想删除重复的值,但我想保留这些产品的每年费用。
example dataframe: 示例数据框:
store product year1 year2 year3
H&M shirt 20.00 29.95 NA
Mango trousers 49.95 NA NA
H&M trousers 39.95 NA 39.95
Mango trousers NA NA 44.95
How I want the dataset to look: 我希望数据集如何显示:
store product year1 year2 year3
H&M shirt 20.00 29.95 NA
H&M trousers 39.95 NA 39.95
Mango trousers 49.95 NA 44.95
I've used dplyr but this only seemed to remove the duplicates, instead of keeping all the costs values. 我使用了dplyr,但这似乎只是删除重复项,而不是保留所有成本值。 Any help is appreciated!
任何帮助表示赞赏!
reproducible code: 可复制的代码:
df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"), product=c("shirt", "trousers", "trousers", "trousers"),
year1=c(20.95, 49.95, 39.95, NA), year2=c(29.95, NA, NA, NA), year3=c(NA,NA,39.95, 44.95))
You can use the package dplyr . 您可以使用dplyr软件包。
dfn<- df %>%
group_by(store, product) %>%
summarise(year1 = sum(year1, na.rm = T),
year2 = sum(year2, na.rm = T),
year3 = sum(year3, na.rm = T))
When you print out dfn , you get 打印dfn时 ,您得到
store product year1 year2 year3
<fctr> <fctr> <dbl> <dbl> <dbl>
1 H&M shirt 20.95 29.95 0.00
2 H&M trousers 39.95 0.00 39.95
3 Mango trousers 49.95 0.00 44.95
You wanted to group by two variables, so the group_by
function is best suited for it. 您想按两个变量分组,所以
group_by
函数最适合它。 I know that you want NAs for where the 0s are, and you can replace that in a subsequent line as 我知道您想要NA代表0,而您可以在下一行中将其替换为
dfn[dfn == 0, ] <- NA
Indeed dplyr
is the way to go. 确实,
dplyr
是必经之路。 First you gather()
the data, then you group_by()
and summarize()
and eventually spread()
it back, filling with NAs where missing, ie: 首先,你
gather()
中的数据,那么你group_by()
和summarize()
并最终spread()
回来,与在缺少,即来港填充:
library(dplyr)
df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"),
product=c("shirt", "trousers", "trousers", "trousers"),
year1=c(20.95, 49.95, 39.95, NA),
year2=c(29.95, NA, NA, NA),
year3=c(NA,NA,39.95, 44.95))
new.df <- df %>%
gather(year, value, -store, -product) %>%
group_by(year, store, product) %>%
summarize(sum.value = sum(value)) %>%
spread(key = year, value = sum.value, fill = NA)
Using -store
and -product
tells gather()
to ignore these two variables and gather the data by year and call the new number column "value" (you can replace this with whatever name you like). 使用
-store
和-product
告诉-product
gather()
忽略这两个变量,并按年份收集数据,并将新的数字列称为“值”(您可以将其替换为任意名称)。
Then group_by()
and summarize()
makes sure we don't run into duplicates (and use the sum of two values in case there are many rows relating to the same store and product). 然后
group_by()
和summarize()
可以确保我们不会陷入重复(和使用两个值之和的情况下有涉及同一家商店,产品多行)。
Eventually spread()
gives the form you are looking for. 最终,
spread()
给出了您想要的形式。
You have to be careful with how you treat duplicates and what you assume about them. 您必须谨慎对待重复项以及对重复项的假设。 This answer assumes that if there are two rows which have the same product and store, appear twice, then the value you want as a result is the sum of year1, sum of year2 and sum of year3.
此答案假定,如果有两行具有相同的产品和商店,并出现两次,那么结果所需的值就是year1的总和,year2的总和和year3的总和。 If NAs are present (in the
group_by()
groups, you will get an NA as a result, unless you add na.rm = TRUE
in the sum command, ie: summarize(sum.value = sum(value, na.rm = TRUE))
. Then you will have 0s instead of NAs. 如果存在NA(在
group_by()
组中,则将得到NA,除非在sum命令中添加na.rm = TRUE
,例如: summarize(sum.value = sum(value, na.rm = TRUE))
,那么您将拥有0而不是NA。
However, the code I supplied works for the example you supplied, and yields the tibble you wanted. 但是,我提供的代码适用于您提供的示例,并产生您想要的小标题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.