[英]Data frame transformation in R
这是我的数据框。
df<-data.frame(
Brand=c("Brand_1","Brand_2","Brand_3","Brand_4","Brand_4","Brand_1","Brand_4","Brand_4","Brand_1","Brand_2","Brand_3","Brand_2","Brand_3","Brand_4"),
M=c("2014-6-1","2014-7-1","2014-8-1","2014-9-1","2014-10-1","2014-11-1","2014-12-1","2015-1-1","2014-2-1","2015-3-1","2014-4-1","2014-5-1","2014-6-1","2014-7-1"),
Price=c(55,55,55,55,58,58,58,58,58,58,59,60,61,62),
Quantity=c(140,150,NA,NA,NA,200,NA,NA,100,100,NA,NA,NA,100)
)
df$M<-as.Date(df$M)
Brand M Price Quantity
------------------------------------------
1 Brand_1 2014-06-01 55 140
2 Brand_1 2014-11-01 58 200
3 Brand_1 2014-12-01 58 100
4 Brand_2 2014-07-01 55 150
5 Brand_2 2015-03-01 58 100
6 Brand_2 2014-05-01 60 NA
7 Brand_3 2014-08-01 55 NA
8 Brand_3 2014-04-01 59 NA
9 Brand_3 2014-06-01 61 NA
10 Brand_4 2014-09-01 55 NA
11 Brand_4 2014-10-01 58 NA
12 Brand_4 2014-12-01 58 NA
13 Brand_4 2015-01-01 58 NA
14 Brand_4 2014-07-01 62 100
-------------------------------------------
我想用dplyr或其他类似下表的包进行更改。即在转换后,我想拥有下面的表,更改以下4件事:
1 Brand_1 2014-06-01 55 140 28
Brand_1 2014-07-01 55 NA 28
Brand_1 2014-08-01 55 NA 28
Brand_1 2014-09-01 55 NA 28
Brand_1 2014-10-01 55 NA 28
2 Brand_1 2014-11-01 58 200 200
3 Brand_1 2014-12-01 58 100 100
4 Brand_2 2014-07-01 55 150 150
上面的表格仅是Brand_1和Brand_2的示例,不包括Brand_3和Brand_4。
我认为这就是您想要的。 可能有一种更简化的方法来执行此操作,但这显示了逻辑。
library(dplyr)
library(tidyr)
首先,通过将M
转换为日期并对Brand
和M
进行排序,来清理data.frame()
。 然后将Brand
分组,并使用tidyr::complete()
填写缺少的月份。
df2 <- df %>%
mutate(M = as.Date(as.character(M))) %>%
arrange(Brand, M) %>%
group_by(Brand) %>%
complete(M = seq.Date(min(M), max(M), by = '1 month'))
现在我们有一些简单的计算。 通过查找没有数量的值来创建Grouping
变量。 数据已按M
排序。 对此分组,并通过取组的min()
来删除Price
,并删除NA。 对Quantity1
做类似的事情,但除以n()
,即组大小。
df2 %>%
ungroup() %>%
mutate(Grouping = cumsum(if_else(is.na(Quantity),FALSE,TRUE))) %>%
group_by(Grouping) %>%
mutate(Price = min(Price, na.rm = T)) %>%
mutate(Quantity1 = min(Quantity, na.rm = T) / n())
# Groups: Grouping [6]
Brand M Price Quantity Grouping Quantity1
<fct> <date> <dbl> <dbl> <int> <dbl>
1 Brand_1 2014-02-01 58 100 1 25
2 Brand_1 2014-03-01 58 NA 1 25
3 Brand_1 2014-04-01 58 NA 1 25
4 Brand_1 2014-05-01 58 NA 1 25
5 Brand_1 2014-06-01 55 140 2 28
6 Brand_1 2014-07-01 55 NA 2 28
7 Brand_1 2014-08-01 55 NA 2 28
8 Brand_1 2014-09-01 55 NA 2 28
9 Brand_1 2014-10-01 55 NA 2 28
10 Brand_1 2014-11-01 58 200 3 66.7
# ... with 23 more rows
如果需要,可以在最后将select(-Grouping)
ungroup()
,然后执行select(-Grouping)
删除此变量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.