[英]How to tidy this messy dataset in R
I'm pretty new to using tidyr
, dplyr
, etc, and I have some data I can't figure out how to tidy in R
. 我对使用
tidyr
, dplyr
等tidyr
dplyr
,而且我掌握了一些数据,无法弄清楚如何在R
进行整理。
Variables are mixed up in rows and columns, and the spreadsheet looks like it's split up so there are different kinds of information on the top rows and the bottom rows. 变量在行和列中混合在一起,电子表格看起来像是分开的,因此在顶部和底部都有不同种类的信息。
A simplified version of it is below. 下面是它的简化版本。
You can imagine this is an exam with 4 questions: 您可以想象这是一个包含四个问题的考试:
IDNum
) got the questions correct ( 1
) or wrong ( 0
). IDNum
给予)是正确的问题( 1
)还是错误的问题( 0
)。 Here is the raw data: 这是原始数据:
Question Q1 Q2 Q3 Q4
Topic English English Math Math
Subtopic Grammar Vocabulary Algebra Geometry
Difficulty 2 4 3 4
IDNum
512 1 1 1 0
102 0 1 0 1
321 1 1 1 1
246 1 1 0 1
248 1 0 1 0
136 1 1 1 1
290 0 1 1 1
753 1 0 0 0
752 1 0 1 1
I'd like to tidy this data set. 我想整理一下这个数据集。 It would look something like the following:
它看起来像以下内容:
IDNum Question Topic Subtopic Difficulty Correct
512 Q1 English Grammar 2 1
512 Q2 English Vocabulary 4 1
512 Q3 Math Algebra 3 1
512 Q4 Math Geometry 4 0
102 Q1 English Grammar 2 0
102 Q2 English Vocabulary 4 1
102 Q3 Math Algebra 3 0
102 Q4 Math Geometry 4 1
321 Q1 English Grammar 2 1
321 Q2 English Vocabulary 4 1
321 Q3 Math Algebra 3 1
321 Q4 Math Geometry 4 1
and so on. 等等。
Thank you! 谢谢!
It's not entirely clear in which format you have the data but hopefully the following will help: 数据的格式尚不完全清楚,但希望以下内容会有所帮助:
data 数据
df <- read.table(text="
Question Q1 Q2 Q3 Q4
Topic English English Math Math
Subtopic Grammar Vocabulary Algebra Geometry
Difficulty 2 4 3 4
IDNum '' '' '' ''
512 1 1 1 0
102 0 1 0 1
321 1 1 1 1
246 1 1 0 1
248 1 0 1 0
136 1 1 1 1
290 0 1 1 1
753 1 0 0 0
752 1 0 1 1",h=F,strin=F)
solution 解
library(tidyverse)
df %>%
# collapse the first rows into column names to prepare for gather/separate combo
setNames(apply(.[1:4,],2,paste,collapse="|")) %>%
rename_at(1,~"IDNum") %>%
# remove useless rows
slice(-(1:5)) %>%
# change IDNum to factor, only useful if the order of IDNum is important (probably it's not)
mutate_at("IDNum",~factor(.x,levels=unique(.x))) %>%
# wide to long
gather(key,correct,-1) %>%
# build your columns (convert to TRUE so Difficulty will be numeric)
separate(key,df[1:4,1],convert = TRUE) %>%
# convert correct to numeric
mutate_at("correct",as.numeric) %>%
# sort
arrange(IDNum)
# # A tibble: 36 x 6
# IDNum Question Topic Subtopic Difficulty correct
# <fctr> <chr> <chr> <chr> <int> <dbl>
# 1 512 Q1 English Grammar 2 1
# 2 512 Q2 English Vocabulary 4 1
# 3 512 Q3 Math Algebra 3 1
# 4 512 Q4 Math Geometry 4 0
# 5 102 Q1 English Grammar 2 0
# 6 102 Q2 English Vocabulary 4 1
# 7 102 Q3 Math Algebra 3 0
# 8 102 Q4 Math Geometry 4 1
# 9 321 Q1 English Grammar 2 1
# 10 321 Q2 English Vocabulary 4 1
# # ... with 26 more rows
Another way, with a few more steps but maybe more intuitive, would be to separate from the start the header and the core of the table. 另一种方法,还有几个步骤,但也许更直观,那就是从头开始分离表头和表的核心。
We create a lookup from the header (that we transpose), and we'll use it on the gathered data later: 我们从头(我们转置)创建一个查找,稍后将在收集到的数据上使用它:
header_lkp <-
as_tibble(t(df[1:4,])) %>%
setNames(.[1,]) %>%
slice(-1)
df_core <-
df %>%
setNames(.[1,]) %>%
slice(-(1:5)) %>%
rename_at(1,~"IDNum") %>%
mutate_at("IDNum",~factor(.x,levels=unique(.x)))
df_core %>%
gather(Question,correct,-IDNum) %>%
mutate_at("correct",as.numeric) %>%
left_join(header_lkp,by="Question") %>%
arrange(IDNum)
(same output) (相同的输出)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.