My data looks like this:
Year Categories January February March April May June July August September October November December 1 1990 A 4564.0 465465.0 12 468 4884.0 12788.00 4218.00 -58445.86 -90643.00 -122840.1 -155037.29 -187234.4286 2 1990 B 6487.0 421214.0 878 2112 421283.0 56456.00 54654.00 515.00 212.00 515.0 212.00 515.0000 3 1990 C 42862.0 512.0 484 48 515.0 212.00 515.00 137858.33 48.00 137858.3 48.00 465.0000 4 1990 D 15.0 -169222.7 90 456 137858.3 48.00 465.00 135673.83 778.00 135673.8 778.00 12.0000 5 1990 E 19164.0 -401699.2 -304 246 135673.8 778.00 12.00 133489.33 57.00 133489.3 57.00 478.0000 6 1991 A 21436.8 -634175.7 -698 36 133489.3 57.00 478.00 131304.83 3.00 131304.8 3.00 331.3333 7 1991 B 23709.6 -866652.2 -1092 -174 131304.8 3.00 -8210.60 129120.33 30425.33 129120.3 -11463.57 337.8333 11 1992 A 32800.8 -1796558.2 -2668 -1014 122566.8 -27597.89 -29087.86 292051.00 82253.33 331147.5 -12728.17 363.8333 12 1992 B 35073.6 -2029034.7 -3062 -1224 120382.3 -32976.00 -34307.17 321333.47 95210.33 367329.4 -14420.56 370.3333 13 1992 C 37346.4 -2261511.2 -3456 -1434 118197.8 -38354.11 -39526.49 350615.94 108167.33 403511.2 -16112.96 376.8333
I would like to manipulate this data frame using tidyverse as follows:
First, there are no equal number of categories per year. All other categories should appear even if other years do not have specific categories. Because as you see for 90s there are 5 categories but for 91s there are only 2 categories.
In this, the data for months should be seen side by side instead of being seen line by line. So in the following way; Jan 90, Feb 90, ..., Dec 90, Jan 91, Feb 91, ...., Dec 91, Jan 92, ..., Dec 92 (These will appear as column names).
I want to see it this way in a column. Years should be deleted and only the unique categories should be displayed in the far left column (under Categories). After that if a category do not specific to a month of year which means there is no data for this month, there can be "0" for this month's below.
I would like to use tidyverse in R for this but I could not write it as code if you help me would be happy.
This is the expected version of the data but as I said the months should place side by side:
Categories Jan.90 Feb.90 Mar.90 Apr.90 May.90 June.90 July.90 Aug.90 Sep.90 Oct.90 Nov.90 Dec.90 Jan.91 Feb.91 Mar.91 1 A 4564 465465.0 12 468 4884.0 12788 4218 -58445.86 -90643 -122840.1 -155037.3 -187234.4 21436.8 -634175.7 -698 2 B 6487 421214.0 878 2112 421283.0 56456 54654 515.00 212 515.0 212.0 515.0 23709.6 -866652.2 -1092 3 C 42862 512.0 484 48 515.0 212 515 137858.33 48 137858.3 48.0 465.0 0.0 0.0 0 4 D 15 -169222.7 90 456 137858.3 48 465 135673.83 778 135673.8 778.0 12.0 0.0 0.0 0 5 E 19164 -401699.2 -304 246 135673.8 778 12 133489.33 57 133489.3 57.0 478.0 0.0 0.0 0 Apr.91 May.91 June.91 July.91 Aug.91 Sep.91 Oct.91 Nov.91 Dec.91 Jan.92 Feb.92 Mar.92 Apr.92 May.92 June.92 July.92 1 36 133489.3 57 478.0 131304.8 3.00 131304.8 3.00 331.3333 32800.8 -1796558 -2668 -1014 122566.8 -27597.89 -29087.86 2 -174 131304.8 3 -8210.6 129120.3 30425.33 129120.3 -11463.57 337.8333 35073.6 -2029035 -3062 -1224 120382.3 -32976.00 -34307.17 3 0 0.0 0 0.0 0.0 0.00 0.0 0.00 0.0000 37346.4 -2261511 -3456 -1434 118197.8 -38354.11 -39526.49 4 0 0.0 0 0.0 0.0 0.00 0.0 0.00 0.0000 0.0 0 0 0 0.0 0.00 0.00 5 0 0.0 0 0.0 0.0 0.00 0.0 0.00 0.0000 0.0 0 0 0 0.0 0.00 0.00 Aug.92 Sep.92 Oct.92 Nov.92 Dec.92 1 292051.0 82253.33 331147.5 -12728.17 363.8333 2 321333.5 95210.33 367329.4 -14420.56 370.3333 3 350615.9 108167.33 403511.2 -16112.96 376.8333 4 0.0 0.00 0.0 0.00 0.0000 5 0.0 0.00 0.0 0.00 0.0000
You could first gather
the data into long format, group_by
Year
and complete
the missing Categories
. We then combine month and year combination using unite
and finally spread
it to wide format by filling empty values to 0.
library(tidyverse)
df %>%
gather(key, value, -Year, -Categories) %>%
group_by(Year) %>%
complete(Categories) %>%
unite(MonthYear, key, Year) %>%
spread(MonthYear, value, fill = 0)
# Categories April_1990 April_1991 April_1992 August_1990 ....
# <fct> <dbl> <dbl> <dbl> <dbl> ....
#1 A 468 36 -1014 -58446. ....
#2 B 2112 -174 -1224 515 ....
#3 C 48 0 -1434 137858. ....
#4 D 456 0 0 135674. ....
#5 E 246 0 0 133489. ....
If we want to maintain the order of the columns one simple way is to convert them to factors
df %>%
gather(key, value, -Year, -Categories) %>%
group_by(Year) %>%
complete(Categories) %>%
unite(MonthYear, key, Year) %>%
mutate(MonthYear = factor(MonthYear, levels = unique(MonthYear))) %>%
spread(MonthYear, value, fill = 0)
# Categories January_1990 February_1990 March_1990 April_1990 ....
# <chr> <dbl> <dbl> <dbl> <dbl> ....
#1 A 4564 465465 12 468 ....
#2 B 6487 421214 878 2112 ....
#3 C 42862 512 484 48 ....
#4 D 15 -169223. 90 456 ....
#5 E 19164 -401699. -304 246 ....
EDIT
As mentioned in comments by OP on real data they get duplicate identifier error for that we can create a unique index for each MonthYear
before spreading
df %>%
gather(key, value, -Year, -Categories) %>%
group_by(Year) %>%
complete(Categories) %>%
unite(MonthYear, key, Year) %>%
mutate(MonthYear = factor(MonthYear, levels = unique(MonthYear))) %>%
group_by(MonthYear) %>%
mutate(i = row_number()) %>%
spread(MonthYear, value) %>%
ungroup() %>%
select(-i)
How about gathering, then paste together year and month, then spread. I use an absurd workaround to keep the order of the columns correct. Try:
library(dplyr)
library(tidyr)
df %>%
gather(k, v, -Year, -Categories, -Categories) %>%
arrange(Categories, Year) %>%
group_by(Categories) %>%
mutate(n = row_number(),
col = paste0("n", 1000+n, substr(k, 1, 3), ".", substr(Year, 3, 4))) %>%
ungroup() %>%
arrange(col) %>%
select(-Year, -k, -n) %>%
spread(col, v, fill = 0) %>%
rename_at(vars(-Categories), ~substr(., 6, nchar(.)))
Result
# A tibble: 5 x 49
Categories Jan.90 Feb.90 Mar.90 Apr.90 May.90 Jun.90 Jul.90 Aug.90 Sep.90 Oct.90 Nov.90 Dec.90 Jan.91 Jan.92 Feb.91 Feb.92 Mar.91 Mar.92 Apr.91 Apr.92 May.91
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 4564 4.65e5 12 468 4.88e3 12788 4218 -58446. -90643 -1.23e5 -1.55e5 -1.87e5 21437. 0 -6.34e5 0. -698 0 36 0 1.33e5
2 B 6487 4.21e5 878 2112 4.21e5 56456 54654 515 212 5.15e2 2.12e2 5.15e2 23710. 0 -8.67e5 0. -1092 0 -174 0 1.31e5
3 C 42862 5.12e2 484 48 5.15e2 212 515 137858. 48 1.38e5 4.80e1 4.65e2 0 37346. 0. -2.26e6 0 -3456 0 -1434 0.
4 D 15 -1.69e5 90 456 1.38e5 48 465 135674. 778 1.36e5 7.78e2 1.20e1 0 0 0. 0. 0 0 0 0 0.
5 E 19164 -4.02e5 -304 246 1.36e5 778 12 133489. 57 1.33e5 5.70e1 4.78e2 0 0 0. 0. 0 0 0 0 0.
# … with 27 more variables: May.92 <dbl>, Jun.91 <dbl>, Jun.92 <dbl>, Jul.91 <dbl>, Jul.92 <dbl>, Aug.91 <dbl>, Aug.92 <dbl>, Sep.91 <dbl>, Sep.92 <dbl>, Oct.91 <dbl>,
# Oct.92 <dbl>, Nov.91 <dbl>, Nov.92 <dbl>, Dec.91 <dbl>, Dec.92 <dbl>, Jan.92 <dbl>, Feb.92 <dbl>, Mar.92 <dbl>, Apr.92 <dbl>, May.92 <dbl>, Jun.92 <dbl>, Jul.92 <dbl>,
# Aug.92 <dbl>, Sep.92 <dbl>, Oct.92 <dbl>, Nov.92 <dbl>, Dec.92 <dbl>
Data
df <- structure(list(Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L,
1991L, 1992L, 1992L, 1992L), Categories = c("A", "B", "C", "D",
"E", "A", "B", "A", "B", "C"), January = c(4564, 6487, 42862,
15, 19164, 21436.8, 23709.6, 32800.8, 35073.6, 37346.4), February = c(465465,
421214, 512, -169222.7, -401699.2, -634175.7, -866652.2, -1796558.2,
-2029034.7, -2261511.2), March = c(12L, 878L, 484L, 90L, -304L,
-698L, -1092L, -2668L, -3062L, -3456L), April = c(468L, 2112L,
48L, 456L, 246L, 36L, -174L, -1014L, -1224L, -1434L), May = c(4884,
421283, 515, 137858.3, 135673.8, 133489.3, 131304.8, 122566.8,
120382.3, 118197.8), June = c(12788, 56456, 212, 48, 778, 57,
3, -27597.89, -32976, -38354.11), July = c(4218, 54654, 515,
465, 12, 478, -8210.6, -29087.86, -34307.17, -39526.49), August = c(-58445.86,
515, 137858.33, 135673.83, 133489.33, 131304.83, 129120.33, 292051,
321333.47, 350615.94), September = c(-90643, 212, 48, 778, 57,
3, 30425.33, 82253.33, 95210.33, 108167.33), October = c(-122840.1,
515, 137858.3, 135673.8, 133489.3, 131304.8, 129120.3, 331147.5,
367329.4, 403511.2), November = c(-155037.29, 212, 48, 778, 57,
3, -11463.57, -12728.17, -14420.56, -16112.96), December = c(-187234.4286,
515, 465, 12, 478, 331.3333, 337.8333, 363.8333, 370.3333, 376.8333
)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.