简体   繁体   English

如何清理R中的数据?

[英]How to clean up data in R?

Currently, my data is in the following format:目前,我的数据格式如下:

Var   year   Co-1      Co-2     Co-3  ...
A     2018     a         j        .  
A     2017     b         k        .
A     2016     c         l        .
B     2018     d         m        .
B     2017     e         n        .
B     2016     f         o        .
C     2018     g         p        .
C     2017     h         q        .
C     2016     i         r        .
.       .      .         .        .
.       .      .         .
.       .      .         .

I want to transform it to the following format:我想将其转换为以下格式:

Company   year    A       B       C
Co-1      2018    a       d       g
Co-1      2017    b       e       h
Co-1      2016    c       f       i
Co-2      2018    j       m       p 
Co-2      2017    k       n       q
Co-2      2016    l       o       r
Co-3      2018    .       .
Co-3      2017    .       .
Co-3      2016    .       .
.
.
.

Essentially the changes are:本质上的变化是:

  1. Inserting the company name multiple times in the first column, one for each year (2018,17,16)在第一列中多次插入公司名称,每年一个(2018,17,16)
  2. Making the variable in each column header be A, B, and C, rather than having multiple AAA,BBB,CCCs in the first column使每列标题中的变量为 A、B 和 C,而不是在第一列中有多个 AAA、BBB、CCC

By doing this, I want to be able to regress year vs. each of A, B, and C separately, while keeping the Company distinction for each data point, so I can group the data points by company in the finished graph.通过这样做,我希望能够分别回归年份与 A、B 和 C 中的每一个,同时保持每个数据点的公司区别,因此我可以在完成的图表中按公司对数据点进行分组。

Thank you so much!非常感谢!

Get the data in long format and then in wide but with different columns.以长格式获取数据,然后以宽但具有不同列的格式获取数据。

library(tidyr)

df %>%
  pivot_longer(cols = starts_with("Co")) %>%
  pivot_wider(names_from = Var, values_from = value)

# A tibble: 6 x 5
#   year name  A     B     C    
#  <int> <chr> <fct> <fct> <fct>
#1  2018 Co-1  a     d     g    
#2  2018 Co-2  j     m     p    
#3  2017 Co-1  b     e     h    
#4  2017 Co-2  k     n     q    
#5  2016 Co-1  c     f     i    
#6  2016 Co-2  l     o     r    

data数据

df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), year = c(2018L, 
2017L, 2016L, 2018L, 2017L, 2016L, 2018L, 2017L, 2016L), `Co-1` = 
structure(1:9, .Label = c("a", "b", "c", "d", "e", "f", "g", "h", "i"), 
class = "factor"), `Co-2` = structure(1:9, .Label = c("j", 
"k", "l", "m", "n", "o", "p", "q", "r"), class = "factor")), class = "data.frame", 
row.names = c(NA, -9L))

I think it would be much easier to simply use factors here.我认为在这里简单地使用因子会容易得多。 Length of labels should be the number of unique companies, ie "Var" values标签的长度应该是唯一公司的数量,即"Var"

df$Var <- factor(df$Var, labels=c("Pepsi", "Coke", "Sprite"))
names(df) <- c("company", "year", LETTERS[seq(names(df)[-(1:2)])])

Or in a single step:或者在一个步骤中:

df <- setNames(transform(df, Var=factor(Var, labels=c("Pepsi", "Coke", "Sprite"))),
               c("company", "year", LETTERS[seq(names(df)[-(1:2)])]))
df
#   company year A B C
# 1   Pepsi 2018 a j .
# 2   Pepsi 2017 b k .
# 3   Pepsi 2016 c l .
# 4    Coke 2018 d m .
# 5    Coke 2017 e n .
# 6    Coke 2016 f o .
# 7  Sprite 2018 g p .
# 8  Sprite 2017 h q .
# 9  Sprite 2016 i r .

Also produces cleaner classes:还产生更清洁的类:

sapply(df, class)
# company        year           A           B           C 
# "factor"   "integer" "character" "character" "character" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM