简体   繁体   English

如何在R中整理这个凌乱的数据集

[英]How to tidy this messy dataset in R

I'm pretty new to using tidyr , dplyr , etc, and I have some data I can't figure out how to tidy in R . 我对使用tidyrdplyrtidyr dplyr ,而且我掌握了一些数据,无法弄清楚如何在R进行整理。

Variables are mixed up in rows and columns, and the spreadsheet looks like it's split up so there are different kinds of information on the top rows and the bottom rows. 变量在行和列中混合在一起,电子表格看起来像是分开的,因此在顶部和底部都有不同种类的信息。

A simplified version of it is below. 下面是它的简化版本。

You can imagine this is an exam with 4 questions: 您可以想象这是一个包含四个问题的考试:

  • The top few rows give some information about each question 前几行提供有关每个问题的一些信息
  • The bottom rows show whether various students (given by their IDNum ) got the questions correct ( 1 ) or wrong ( 0 ). 最下面的行显示了各个学生(由其IDNum给予)是正确的问题( 1 )还是错误的问题( 0 )。

Here is the raw data: 这是原始数据:

Question    Q1         Q2         Q3         Q4
Topic       English    English    Math       Math
Subtopic    Grammar    Vocabulary Algebra    Geometry
Difficulty  2          4          3          4
IDNum               
512         1          1          1          0
102         0          1          0          1
321         1          1          1          1
246         1          1          0          1
248         1          0          1          0
136         1          1          1          1
290         0          1          1          1
753         1          0          0          0
752         1          0          1          1

I'd like to tidy this data set. 我想整理一下这个数据集。 It would look something like the following: 它看起来像以下内容:

IDNum   Question    Topic   Subtopic    Difficulty  Correct
512     Q1          English Grammar     2           1
512     Q2          English Vocabulary  4           1
512     Q3          Math    Algebra     3           1
512     Q4          Math    Geometry    4           0
102     Q1          English Grammar     2           0
102     Q2          English Vocabulary  4           1
102     Q3          Math    Algebra     3           0
102     Q4          Math    Geometry    4           1
321     Q1          English Grammar     2           1
321     Q2          English Vocabulary  4           1
321     Q3          Math    Algebra     3           1
321     Q4          Math    Geometry    4           1

and so on. 等等。

Thank you! 谢谢!

It's not entirely clear in which format you have the data but hopefully the following will help: 数据的格式尚不完全清楚,但希望以下内容会有所帮助:

data 数据

df <- read.table(text="
Question    Q1         Q2         Q3         Q4
Topic       English    English    Math       Math
Subtopic    Grammar    Vocabulary Algebra    Geometry
Difficulty  2          4          3          4
IDNum       ''        ''          ''         ''
512         1          1          1          0
102         0          1          0          1
321         1          1          1          1
246         1          1          0          1
248         1          0          1          0
136         1          1          1          1
290         0          1          1          1
753         1          0          0          0
752         1          0          1          1",h=F,strin=F)

solution

library(tidyverse)
df %>%
  # collapse the first rows into column names to prepare for gather/separate combo
  setNames(apply(.[1:4,],2,paste,collapse="|")) %>% 
  rename_at(1,~"IDNum")   %>%
  # remove useless rows
  slice(-(1:5))           %>%
  # change IDNum to factor, only useful if the order of IDNum is important (probably it's not)
  mutate_at("IDNum",~factor(.x,levels=unique(.x))) %>%
  # wide to long
  gather(key,correct,-1)  %>%
  # build your columns (convert to TRUE so Difficulty will be numeric)
  separate(key,df[1:4,1],convert = TRUE) %>%
  # convert correct to numeric
  mutate_at("correct",as.numeric) %>%
  # sort
  arrange(IDNum)

# # A tibble: 36 x 6
#     IDNum Question   Topic   Subtopic Difficulty correct
#    <fctr>    <chr>   <chr>      <chr>      <int>   <dbl>
#  1    512       Q1 English    Grammar          2       1
#  2    512       Q2 English Vocabulary          4       1
#  3    512       Q3    Math    Algebra          3       1
#  4    512       Q4    Math   Geometry          4       0
#  5    102       Q1 English    Grammar          2       0
#  6    102       Q2 English Vocabulary          4       1
#  7    102       Q3    Math    Algebra          3       0
#  8    102       Q4    Math   Geometry          4       1
#  9    321       Q1 English    Grammar          2       1
# 10    321       Q2 English Vocabulary          4       1
# # ... with 26 more rows

Another way, with a few more steps but maybe more intuitive, would be to separate from the start the header and the core of the table. 另一种方法,还有几个步骤,但也许更直观,那就是从头开始分离表头和表的核心。

We create a lookup from the header (that we transpose), and we'll use it on the gathered data later: 我们从头(我们转置)创建一个查找,稍后将在收集到的数据上使用它:

header_lkp <-
  as_tibble(t(df[1:4,])) %>%
  setNames(.[1,]) %>%
  slice(-1)

df_core <-
  df %>%
  setNames(.[1,]) %>%
  slice(-(1:5))   %>%
  rename_at(1,~"IDNum") %>%
  mutate_at("IDNum",~factor(.x,levels=unique(.x)))

df_core %>%
  gather(Question,correct,-IDNum) %>%
  mutate_at("correct",as.numeric) %>%
  left_join(header_lkp,by="Question") %>%
  arrange(IDNum)

(same output) (相同的输出)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM