[英]How to clean up data in R?
目前,我的数据格式如下:
Var year Co-1 Co-2 Co-3 ...
A 2018 a j .
A 2017 b k .
A 2016 c l .
B 2018 d m .
B 2017 e n .
B 2016 f o .
C 2018 g p .
C 2017 h q .
C 2016 i r .
. . . . .
. . . .
. . . .
我想将其转换为以下格式:
Company year A B C
Co-1 2018 a d g
Co-1 2017 b e h
Co-1 2016 c f i
Co-2 2018 j m p
Co-2 2017 k n q
Co-2 2016 l o r
Co-3 2018 . .
Co-3 2017 . .
Co-3 2016 . .
.
.
.
本质上的变化是:
通过这样做,我希望能够分别回归年份与 A、B 和 C 中的每一个,同时保持每个数据点的公司区别,因此我可以在完成的图表中按公司对数据点进行分组。
非常感谢!
以长格式获取数据,然后以宽但具有不同列的格式获取数据。
library(tidyr)
df %>%
pivot_longer(cols = starts_with("Co")) %>%
pivot_wider(names_from = Var, values_from = value)
# A tibble: 6 x 5
# year name A B C
# <int> <chr> <fct> <fct> <fct>
#1 2018 Co-1 a d g
#2 2018 Co-2 j m p
#3 2017 Co-1 b e h
#4 2017 Co-2 k n q
#5 2016 Co-1 c f i
#6 2016 Co-2 l o r
数据
df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), year = c(2018L,
2017L, 2016L, 2018L, 2017L, 2016L, 2018L, 2017L, 2016L), `Co-1` =
structure(1:9, .Label = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
class = "factor"), `Co-2` = structure(1:9, .Label = c("j",
"k", "l", "m", "n", "o", "p", "q", "r"), class = "factor")), class = "data.frame",
row.names = c(NA, -9L))
我认为在这里简单地使用因子会容易得多。 标签的长度应该是唯一公司的数量,即"Var"
值
df$Var <- factor(df$Var, labels=c("Pepsi", "Coke", "Sprite"))
names(df) <- c("company", "year", LETTERS[seq(names(df)[-(1:2)])])
或者在一个步骤中:
df <- setNames(transform(df, Var=factor(Var, labels=c("Pepsi", "Coke", "Sprite"))),
c("company", "year", LETTERS[seq(names(df)[-(1:2)])]))
df
# company year A B C
# 1 Pepsi 2018 a j .
# 2 Pepsi 2017 b k .
# 3 Pepsi 2016 c l .
# 4 Coke 2018 d m .
# 5 Coke 2017 e n .
# 6 Coke 2016 f o .
# 7 Sprite 2018 g p .
# 8 Sprite 2017 h q .
# 9 Sprite 2016 i r .
还产生更清洁的类:
sapply(df, class)
# company year A B C
# "factor" "integer" "character" "character" "character"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.