简体   繁体   中英

How to manipulate this dataframe with using tidyverse

My data looks like this:

Year Categories January   February March April      May      June      July    August September   October   November     December
1  1990          A  4564.0   465465.0    12   468   4884.0  12788.00   4218.00 -58445.86 -90643.00 -122840.1 -155037.29 -187234.4286
2  1990          B  6487.0   421214.0   878  2112 421283.0  56456.00  54654.00    515.00    212.00     515.0     212.00     515.0000
3  1990          C 42862.0      512.0   484    48    515.0    212.00    515.00 137858.33     48.00  137858.3      48.00     465.0000
4  1990          D    15.0  -169222.7    90   456 137858.3     48.00    465.00 135673.83    778.00  135673.8     778.00      12.0000
5  1990          E 19164.0  -401699.2  -304   246 135673.8    778.00     12.00 133489.33     57.00  133489.3      57.00     478.0000
6  1991          A 21436.8  -634175.7  -698    36 133489.3     57.00    478.00 131304.83      3.00  131304.8       3.00     331.3333
7  1991          B 23709.6  -866652.2 -1092  -174 131304.8      3.00  -8210.60 129120.33  30425.33  129120.3  -11463.57     337.8333
11 1992          A 32800.8 -1796558.2 -2668 -1014 122566.8 -27597.89 -29087.86 292051.00  82253.33  331147.5  -12728.17     363.8333
12 1992          B 35073.6 -2029034.7 -3062 -1224 120382.3 -32976.00 -34307.17 321333.47  95210.33  367329.4  -14420.56     370.3333
13 1992          C 37346.4 -2261511.2 -3456 -1434 118197.8 -38354.11 -39526.49 350615.94 108167.33  403511.2  -16112.96     376.8333

I would like to manipulate this data frame using tidyverse as follows:

First, there are no equal number of categories per year. All other categories should appear even if other years do not have specific categories. Because as you see for 90s there are 5 categories but for 91s there are only 2 categories.

In this, the data for months should be seen side by side instead of being seen line by line. So in the following way; Jan 90, Feb 90, ..., Dec 90, Jan 91, Feb 91, ...., Dec 91, Jan 92, ..., Dec 92 (These will appear as column names).

I want to see it this way in a column. Years should be deleted and only the unique categories should be displayed in the far left column (under Categories). After that if a category do not specific to a month of year which means there is no data for this month, there can be "0" for this month's below.

I would like to use tidyverse in R for this but I could not write it as code if you help me would be happy.

This is the expected version of the data but as I said the months should place side by side:

Categories Jan.90    Feb.90 Mar.90 Apr.90   May.90 June.90 July.90    Aug.90 Sep.90    Oct.90    Nov.90    Dec.90  Jan.91    Feb.91 Mar.91
1          A   4564  465465.0     12    468   4884.0   12788    4218 -58445.86 -90643 -122840.1 -155037.3 -187234.4 21436.8 -634175.7   -698
2          B   6487  421214.0    878   2112 421283.0   56456   54654    515.00    212     515.0     212.0     515.0 23709.6 -866652.2  -1092
3          C  42862     512.0    484     48    515.0     212     515 137858.33     48  137858.3      48.0     465.0     0.0       0.0      0
4          D     15 -169222.7     90    456 137858.3      48     465 135673.83    778  135673.8     778.0      12.0     0.0       0.0      0
5          E  19164 -401699.2   -304    246 135673.8     778      12 133489.33     57  133489.3      57.0     478.0     0.0       0.0      0
  Apr.91   May.91 June.91 July.91   Aug.91   Sep.91   Oct.91    Nov.91   Dec.91  Jan.92   Feb.92 Mar.92 Apr.92   May.92   June.92   July.92
1     36 133489.3      57   478.0 131304.8     3.00 131304.8      3.00 331.3333 32800.8 -1796558  -2668  -1014 122566.8 -27597.89 -29087.86
2   -174 131304.8       3 -8210.6 129120.3 30425.33 129120.3 -11463.57 337.8333 35073.6 -2029035  -3062  -1224 120382.3 -32976.00 -34307.17
3      0      0.0       0     0.0      0.0     0.00      0.0      0.00   0.0000 37346.4 -2261511  -3456  -1434 118197.8 -38354.11 -39526.49
4      0      0.0       0     0.0      0.0     0.00      0.0      0.00   0.0000     0.0        0      0      0      0.0      0.00      0.00
5      0      0.0       0     0.0      0.0     0.00      0.0      0.00   0.0000     0.0        0      0      0      0.0      0.00      0.00
    Aug.92    Sep.92   Oct.92    Nov.92   Dec.92
1 292051.0  82253.33 331147.5 -12728.17 363.8333
2 321333.5  95210.33 367329.4 -14420.56 370.3333
3 350615.9 108167.33 403511.2 -16112.96 376.8333
4      0.0      0.00      0.0      0.00   0.0000
5      0.0      0.00      0.0      0.00   0.0000

You could first gather the data into long format, group_by Year and complete the missing Categories . We then combine month and year combination using unite and finally spread it to wide format by filling empty values to 0.

library(tidyverse)

df %>%
  gather(key, value, -Year, -Categories) %>%
  group_by(Year) %>%
  complete(Categories) %>%
  unite(MonthYear, key, Year) %>%
  spread(MonthYear, value, fill = 0)

#  Categories April_1990 April_1991 April_1992 August_1990 ....
#  <fct>           <dbl>      <dbl>      <dbl>       <dbl> ....
#1 A                 468         36      -1014     -58446. ....
#2 B                2112       -174      -1224        515  ....
#3 C                  48          0      -1434     137858. ....
#4 D                 456          0          0     135674. ....
#5 E                 246          0          0     133489. ....

If we want to maintain the order of the columns one simple way is to convert them to factors

df %>%
   gather(key, value, -Year, -Categories) %>%
   group_by(Year) %>%
   complete(Categories) %>%
   unite(MonthYear, key, Year) %>%
   mutate(MonthYear = factor(MonthYear, levels = unique(MonthYear))) %>%
   spread(MonthYear, value, fill = 0)


#  Categories January_1990 February_1990 March_1990 April_1990 ....
#  <chr>             <dbl>         <dbl>      <dbl>      <dbl> ....
#1 A                  4564       465465          12        468 ....
#2 B                  6487       421214         878       2112 ....
#3 C                 42862          512         484         48 ....
#4 D                    15      -169223.         90        456 ....
#5 E                 19164      -401699.       -304        246 ....

EDIT

As mentioned in comments by OP on real data they get duplicate identifier error for that we can create a unique index for each MonthYear before spreading

df %>%
  gather(key, value, -Year, -Categories) %>%
  group_by(Year) %>%
  complete(Categories) %>%
  unite(MonthYear, key, Year) %>%
  mutate(MonthYear = factor(MonthYear, levels = unique(MonthYear))) %>%
  group_by(MonthYear) %>%
  mutate(i = row_number()) %>%
  spread(MonthYear, value) %>%
  ungroup() %>%
  select(-i)

How about gathering, then paste together year and month, then spread. I use an absurd workaround to keep the order of the columns correct. Try:

library(dplyr)
library(tidyr)

df %>% 
  gather(k, v, -Year, -Categories, -Categories) %>% 
  arrange(Categories, Year) %>% 
  group_by(Categories) %>% 
  mutate(n = row_number(),
         col = paste0("n", 1000+n, substr(k, 1, 3), ".", substr(Year, 3, 4))) %>% 
  ungroup() %>% 
  arrange(col) %>% 
  select(-Year, -k, -n) %>% 
  spread(col, v, fill = 0) %>% 
  rename_at(vars(-Categories), ~substr(., 6, nchar(.)))

Result

# A tibble: 5 x 49
  Categories Jan.90  Feb.90 Mar.90 Apr.90 May.90 Jun.90 Jul.90  Aug.90 Sep.90  Oct.90  Nov.90  Dec.90 Jan.91 Jan.92  Feb.91  Feb.92 Mar.91 Mar.92 Apr.91 Apr.92 May.91
  <chr>       <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 A            4564  4.65e5     12    468 4.88e3  12788   4218 -58446. -90643 -1.23e5 -1.55e5 -1.87e5 21437.     0  -6.34e5  0.       -698      0     36      0 1.33e5
2 B            6487  4.21e5    878   2112 4.21e5  56456  54654    515     212  5.15e2  2.12e2  5.15e2 23710.     0  -8.67e5  0.      -1092      0   -174      0 1.31e5
3 C           42862  5.12e2    484     48 5.15e2    212    515 137858.     48  1.38e5  4.80e1  4.65e2     0  37346.  0.     -2.26e6      0  -3456      0  -1434 0.    
4 D              15 -1.69e5     90    456 1.38e5     48    465 135674.    778  1.36e5  7.78e2  1.20e1     0      0   0.      0.          0      0      0      0 0.    
5 E           19164 -4.02e5   -304    246 1.36e5    778     12 133489.     57  1.33e5  5.70e1  4.78e2     0      0   0.      0.          0      0      0      0 0.    
# … with 27 more variables: May.92 <dbl>, Jun.91 <dbl>, Jun.92 <dbl>, Jul.91 <dbl>, Jul.92 <dbl>, Aug.91 <dbl>, Aug.92 <dbl>, Sep.91 <dbl>, Sep.92 <dbl>, Oct.91 <dbl>,
#   Oct.92 <dbl>, Nov.91 <dbl>, Nov.92 <dbl>, Dec.91 <dbl>, Dec.92 <dbl>, Jan.92 <dbl>, Feb.92 <dbl>, Mar.92 <dbl>, Apr.92 <dbl>, May.92 <dbl>, Jun.92 <dbl>, Jul.92 <dbl>,
#   Aug.92 <dbl>, Sep.92 <dbl>, Oct.92 <dbl>, Nov.92 <dbl>, Dec.92 <dbl>

Data

df <- structure(list(Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 
                              1991L, 1992L, 1992L, 1992L), Categories = c("A", "B", "C", "D", 
                                                                          "E", "A", "B", "A", "B", "C"), January = c(4564, 6487, 42862, 
                                                                                                                     15, 19164, 21436.8, 23709.6, 32800.8, 35073.6, 37346.4), February = c(465465, 
                                                                                                                                                                                           421214, 512, -169222.7, -401699.2, -634175.7, -866652.2, -1796558.2, 
                                                                                                                                                                                           -2029034.7, -2261511.2), March = c(12L, 878L, 484L, 90L, -304L, 
                                                                                                                                                                                                                              -698L, -1092L, -2668L, -3062L, -3456L), April = c(468L, 2112L, 
                                                                                                                                                                                                                                                                                48L, 456L, 246L, 36L, -174L, -1014L, -1224L, -1434L), May = c(4884, 
                                                                                                                                                                                                                                                                                                                                              421283, 515, 137858.3, 135673.8, 133489.3, 131304.8, 122566.8, 
                                                                                                                                                                                                                                                                                                                                              120382.3, 118197.8), June = c(12788, 56456, 212, 48, 778, 57, 
                                                                                                                                                                                                                                                                                                                                                                            3, -27597.89, -32976, -38354.11), July = c(4218, 54654, 515, 
                                                                                                                                                                                                                                                                                                                                                                                                                       465, 12, 478, -8210.6, -29087.86, -34307.17, -39526.49), August = c(-58445.86, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           515, 137858.33, 135673.83, 133489.33, 131304.83, 129120.33, 292051, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           321333.47, 350615.94), September = c(-90643, 212, 48, 778, 57, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                3, 30425.33, 82253.33, 95210.33, 108167.33), October = c(-122840.1, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         515, 137858.3, 135673.8, 133489.3, 131304.8, 129120.3, 331147.5, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         367329.4, 403511.2), November = c(-155037.29, 212, 48, 778, 57, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           3, -11463.57, -12728.17, -14420.56, -16112.96), December = c(-187234.4286, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        515, 465, 12, 478, 331.3333, 337.8333, 363.8333, 370.3333, 376.8333
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           )), row.names = c(NA, -10L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 "tbl", "data.frame"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM