简体   繁体   中英

How to clean up data in R?

Currently, my data is in the following format:

Var   year   Co-1      Co-2     Co-3  ...
A     2018     a         j        .  
A     2017     b         k        .
A     2016     c         l        .
B     2018     d         m        .
B     2017     e         n        .
B     2016     f         o        .
C     2018     g         p        .
C     2017     h         q        .
C     2016     i         r        .
.       .      .         .        .
.       .      .         .
.       .      .         .

I want to transform it to the following format:

Company   year    A       B       C
Co-1      2018    a       d       g
Co-1      2017    b       e       h
Co-1      2016    c       f       i
Co-2      2018    j       m       p 
Co-2      2017    k       n       q
Co-2      2016    l       o       r
Co-3      2018    .       .
Co-3      2017    .       .
Co-3      2016    .       .
.
.
.

Essentially the changes are:

  1. Inserting the company name multiple times in the first column, one for each year (2018,17,16)
  2. Making the variable in each column header be A, B, and C, rather than having multiple AAA,BBB,CCCs in the first column

By doing this, I want to be able to regress year vs. each of A, B, and C separately, while keeping the Company distinction for each data point, so I can group the data points by company in the finished graph.

Thank you so much!

Get the data in long format and then in wide but with different columns.

library(tidyr)

df %>%
  pivot_longer(cols = starts_with("Co")) %>%
  pivot_wider(names_from = Var, values_from = value)

# A tibble: 6 x 5
#   year name  A     B     C    
#  <int> <chr> <fct> <fct> <fct>
#1  2018 Co-1  a     d     g    
#2  2018 Co-2  j     m     p    
#3  2017 Co-1  b     e     h    
#4  2017 Co-2  k     n     q    
#5  2016 Co-1  c     f     i    
#6  2016 Co-2  l     o     r    

data

df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), year = c(2018L, 
2017L, 2016L, 2018L, 2017L, 2016L, 2018L, 2017L, 2016L), `Co-1` = 
structure(1:9, .Label = c("a", "b", "c", "d", "e", "f", "g", "h", "i"), 
class = "factor"), `Co-2` = structure(1:9, .Label = c("j", 
"k", "l", "m", "n", "o", "p", "q", "r"), class = "factor")), class = "data.frame", 
row.names = c(NA, -9L))

I think it would be much easier to simply use factors here. Length of labels should be the number of unique companies, ie "Var" values

df$Var <- factor(df$Var, labels=c("Pepsi", "Coke", "Sprite"))
names(df) <- c("company", "year", LETTERS[seq(names(df)[-(1:2)])])

Or in a single step:

df <- setNames(transform(df, Var=factor(Var, labels=c("Pepsi", "Coke", "Sprite"))),
               c("company", "year", LETTERS[seq(names(df)[-(1:2)])]))
df
#   company year A B C
# 1   Pepsi 2018 a j .
# 2   Pepsi 2017 b k .
# 3   Pepsi 2016 c l .
# 4    Coke 2018 d m .
# 5    Coke 2017 e n .
# 6    Coke 2016 f o .
# 7  Sprite 2018 g p .
# 8  Sprite 2017 h q .
# 9  Sprite 2016 i r .

Also produces cleaner classes:

sapply(df, class)
# company        year           A           B           C 
# "factor"   "integer" "character" "character" "character" 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM