简体   繁体   English

在 R 中自动编码(求和)

[英]Automate coding (sum) in R

First at all I would like to apologise if I did not use the correct jargon.首先,如果我没有使用正确的术语,我想道歉。

I have the dataset as below which contains a wide range of categories我有如下数据集,其中包含广泛的类别

Here some excerpt from dput (using droplevels)这是 dput 的一些摘录(使用 droplevels)

structure(list(
x = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L), *[ME: there are more years than 2010...]*
y = c(7.85986, 185.81068, 107.24097, 7094.74649, 
1.4982, 185.77319, 5090.79354, 167.58584, 4189.64609, 157.08277, 
3927.06932, 2.86732, 71.683, 4.70123, 117.53085, 2.93452, 73.36292, 
1.4982, 18.18734, 901.14744, 0.90268, 13.77532, 613.38298, 0.01845, 
0.0681, 7.19925, 3.75315, 0.14333, 136.54008, 0.04766, 0.59077, 
28.97255, 0.38608, 115.05258, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
x1 = structure(c(4L, 2L, 3L, 1L, 4L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 4L, 2L, 1L, 4L, 2L, 1L, 4L, 2L, 
1L, 2L, 4L, 1L, 4L, 2L, 1L, 4L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L), .Label = c("All greenhouse gases - (CO2 equivalent)", 
"CH4", "CO2", "N2O"), class = "factor"), 
x2 = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Austria",         
class = "factor"), 
x4 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 
4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 
10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 14L, 14L, 
15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L), .Label = c("3", 
"3.1", "3.A", "3.A.1", "3.A.2", "3.A.3", "3.A.4", "3.B", 
"3.B.1", "3.B.2", "3.B.3", "3.B.4", "3.B.5", "3.C", "3.C.1", 
"3.C.2", "3.C.3", "3.C.4"), class = "factor")), class = "data.frame",     
row.names = c(NA, 
-44L))

I want to know whether the of the sum of subcategories in x4 (eg 3.B.1+3.B.2+...+3.Bn) equal the figure stated in the parent category (eg 3.B).我想知道 x4 中子类别的总和(例如 3.B.1+3.B.2+...+3.Bn)是否等于父类别中规定的数字(例如 3.B)。 (ie the in the csv stated sum) for a given year and country. (即 csv 中规定的总和)对于给定的年份和国家。 I want to verify the sums.我想验证总和。

For get the sum of the subcategories I have this为了获得子类别的总和,我有这个

sum(df$y[df$x4 %in% c("3.A.1", "3.A.2", "3.A.3", "3.A.4") & x == 
"2010" & x2 == "Austria"])

To receive the sum of the parent category I have this要接收父类别的总和,我有这个

sum(df$y[df$x4 %in% c("3.A") & x == "2010" & x2 == "Austria"])

Next I would need an operation which checks whether the results of both codes are equal (True/false).接下来我需要一个操作来检查两个代码的结果是否相等(真/假)。 However, I have more than 20 countries, 20 years, dozens of categories to check.但是,我有 20 多个国家,20 年,几十个类别要检查。 With my newby approach I would be writing code for ages...使用我的 newby 方法,我将编写多年的代码......

is there anyway to automate this?反正有自动化吗? Basically, I am looking for a code which is able to do the following基本上,我正在寻找能够执行以下操作的代码

1) Run for one category, go to next one 2) once done with categories change year and start again with categories 3) ... same for countries.... 1) 参加一个类别,进入下一个类别 2) 一旦完成类别更改年份,然后从类别重新开始 3) ... 国家相同....

Any sort of help would be appreciated and even a suggestions how to use the right jargon in the title.任何形式的帮助将不胜感激,甚至建议如何在标题中使用正确的行话。 Thanks in any case无论如何谢谢

Here's a potential solution using dplyr (might require some tweaking based on the full dataset):这是使用dplyr的潜在解决方案(可能需要根据完整数据集进行一些调整):

require(dplyr)
# Create two columns - one that shows only the parent category number, and one that tells you if it's a parent or child; note that the regex here makes some assumptions on the format of your data.
mutate(df,parent=gsub("(.?\\..?)\\..*", "\\1", df$x4), 
  type=ifelse(parent==x4,"Parent","Child")) %>% 
# Sum the children y's by category, year and country
group_by(parent, type, x, x2) %>% 
summarize(sum(y)) %>% 
# See if the sum of the children is equal to the parent y
tidyr::spread(type,`sum(y)`) %>%
mutate(equals=isTRUE(all.equal(Child,Parent)))

Result using your (new) data:使用您的(新)数据的结果:

  parent     x x2      Child Parent equals
  <chr>  <int> <fct>   <dbl>  <dbl> <lgl> 
1 3       2010 Austria   NA   7396. FALSE 
2 3.1     2010 Austria   NA   5278. FALSE 
3 3.A     2010 Austria 4357.  4357. TRUE  
4 3.B     2010 Austria  921.   921. TRUE  
5 3.C     2010 Austria    0      0  TRUE 

I can see from your new data that you have two levels of parents.我可以从你的新数据中看到你有两个级别的父母。 My solution will only work for the second level (eg 3.1 and its children), but can be easily tweaked to also work for the top level.我的解决方案仅适用于第二级(例如 3.1 及其子级),但可以轻松调整以适用于顶级。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM