[英]R: How to spread, group_by, summarise and mutate at the same time
I want to spread
this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. 我想通过列'Year'
spread
下面的数据(此处显示的前12行),返回按'CountryName'分组的'Orders'的总和。 Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015. 然后计算2014年至2015年每个'CountryName'的'订单'的变化百分比。
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. 我可以使用较小的测试数据框来完成这项工作,但似乎只能返回无穷无尽的错误,例如“总和对于因素没有意义”或“行的重复标识符”和完整数据。 After hours of reading the dplyr docs and trying things I've given up.
经过几个小时的阅读dplyr文档和尝试我放弃的东西。 Can anyone help with this code...
任何人都可以帮助这个代码......
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. 预期的输出将是类似于下面的表格。 (Note: these numbers are for illustrative purposes, they are not hand calculated.)
(注意:这些数字仅用于说明目的,不是手工计算的。)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit 编辑
I had to make a few edits to the variable names, please note. 我必须对变量名做一些编辑,请注意。
Sum first, while your data are still in long format, then spread. 首先,当您的数据仍然是长格式时,然后传播。 Here's an example with fake data:
这是假数据的一个例子:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct 1 A 575 599 -4.173913 2 B 457 486 -6.345733 3 C 481 319 33.679834 4 D 423 481 -13.711584 5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table: 如果你有多年,那么在你准备好制作一个好的输出表之前,可能更容易保持长格式:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct <fctr> <int> <int> <dbl> 1 A 2010 205 NA 2 A 2011 144 29.756098 3 A 2012 226 -56.944444 4 A 2013 119 47.345133 5 A 2014 177 -48.739496 6 A 2015 303 -71.186441 7 B 2010 146 NA 8 B 2011 159 -8.904110 9 B 2012 152 4.402516 10 B 2013 180 -18.421053 # ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out. 这不是一个答案,因为你没有真正问过一个可重复的问题,而只是提供帮助。
Error 1 You're getting this error duplicate identifiers for rows
likely because of spread
. 错误1由于
spread
您可能会收到此错误duplicate identifiers for rows
。 spread
wants to make N
columns of your N
unique values but it needs to know which unique row to place those values. spread
希望为您的N
唯一值创建N
列,但它需要知道放置这些值的唯一行。 If you have duplicate value-combinations, for instance: 如果您有重复的值组合,例如:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread
gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread...
before spread
. 显示两次,然后
spread
混淆它应该放置数据的哪一行。快速解决方法是data %>% mutate(row=row_number()) %>% spread...
spread
之前。
Error 2 You're getting this error sum not meaningful for factors
likely because of summarise_all
. 错误2您可能因为
summarise_all
sum not meaningful for factors
此错误sum not meaningful for factors
。 summarise_all
will operate on all columns but some columns contain strings (or factors). summarise_all
将对所有列进行操作,但某些列包含字符串(或因子)。 What does United Kingdom + United Kingdom
equal? United Kingdom + United Kingdom
如何? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015))
. 尝试
summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015))
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.