简体   繁体   English

从r中的excel文件读取数据清洗问题(一)

[英]Data cleaning question read from excel files in r (1)

I have data cleaning puzzle.我有数据清理难题。 The student dataset needs to be converted to a long data set.学生数据集需要转换为长数据集。

Here is an example that I read from an excel file.这是我从 excel 文件中读取的示例。

    df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
                            "Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
                            "Item 1: Time in seconds","Item 1: Errors",
                            "Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
                            "Item 2: Time in seconds","Item 2: Errors"),
                  School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
                             0,"1 = Incorrect responses",0,1,NA,NA,
                             "[Number of seconds, ex. 87]",0,
                             "1 = Incorrect responses",0,NA,NA,1,1,0,
                             "[Number of seconds, ex. 87]",0),
                 X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
                               NA,"Child response",NA,NA,NA,NA,
                               NA,NA,
                               "Child response",NA,NA,NA,NA,NA,NA,
                               NA,NA),
                 School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,
                               0,"1 = Incorrect responses",0,1,NA,NA,
                               120,0,
                               "1 = Incorrect responses",NA,1,0,1,NA,1,
                               110,0),
                 Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,
                                         NA,"Child response",NA,NA,NA,NA,
                                         NA,NA,
                                         "Child response",NA,NA,NA,NA,NA,NA,
                                         NA,NA))

> df
                    Text_1                    School.2 X_Elementary_School..3                School.4 Y_Elementary_School..2
1                  Scoring                    Teacher:                  Bill:                Teacher:                  John:
2            1 = Incorrect                    DC Name:             X District                DC Name:             X District
3                    Text1          Date (mm/dd/yyyy):                10/7/21      Date (mm/dd/yyyy):                11/7/21
4                    Text2                Child Grade:                      K            Child Grade:                      K
5                    Text3           Student Study ID:            123-2222-2:       Student Study ID:            112-1111-3:
6                    Text4                        <NA>                   <NA>                    <NA>                   <NA>
7     Demo 1: Color Naming                        <NA>                   <NA>                       0                   <NA>
8                 Amarillo                        <NA>                   <NA>                    <NA>                   <NA>
9                     Azul                        <NA>                   <NA>                       1                   <NA>
10                   Verde                        <NA>                   <NA>                    <NA>                   <NA>
11                    Azul                        <NA>                   <NA>                    <NA>                   <NA>
12          Demo 1: Errors                           0                   <NA>                       0                   <NA>
13    Item 1: Color naming     1 = Incorrect responses         Child response 1 = Incorrect responses         Child response
14                Amarillo                           0                   <NA>                       0                   <NA>
15                    Azul                           1                   <NA>                       1                   <NA>
16                   Verde                        <NA>                   <NA>                    <NA>                   <NA>
17                    Azul                        <NA>                   <NA>                    <NA>                   <NA>
18 Item 1: Time in seconds [Number of seconds, ex. 87]                   <NA>                     120                   <NA>
19          Item 1: Errors                           0                   <NA>                       0                   <NA>
20    Item 2: Shape Naming     1 = Incorrect responses         Child response 1 = Incorrect responses         Child response
21         Cuadrado/Cuadro                           0                   <NA>                    <NA>                   <NA>
22         Cuadrado/Cuadro                        <NA>                   <NA>                       1                   <NA>
23                 Círculo                        <NA>                   <NA>                       0                   <NA>
24                Estrella                           1                   <NA>                       1                   <NA>
25                 Círculo                           1                   <NA>                    <NA>                   <NA>
26               Triángulo                           0                   <NA>                       1                   <NA>
27 Item 2: Time in seconds [Number of seconds, ex. 87]                   <NA>                     110                   <NA>
28          Item 2: Errors                           0                   <NA>                       0                   <NA>

This sample dataset is limited only for two schools, two teachers and two students.该样本数据集仅限于两所学校、两名教师和两名学生。

I need to grab first five rows (excluding the third birthday row) that has the demographics information and school information.我需要获取包含人口统计信息和学校信息的前五行(不包括第三个生日行)。 The information starts from third column and and it should increment two columns when there is information in that columns.信息从第三列开始,当该列中有信息时,它应该增加两列。 Sometimes, there is no information in that column so I need to drop that column.有时,该列中没有信息,因此我需要删除该列。

I am looking to automate this process but for now I wrote my code and it grabs manually from rows and columns.我正在寻求自动化这个过程,但现在我编写了我的代码,它手动从行和列中抓取。

This is what I did for that step:
################################################################################
# @@ 1-extract demographics information
# @ 1- grab first 5 rows and schools columns
test<-df[c(1,2,4,5),c(seq(3,dim(df)[2],2))] # first 5 rows, and columns 3,5,..
test<-as.data.frame(t(test))
# removing empty rows that came from empty columns before transpose-
# I need only filled columns
test <- test[rowSums(is.na(test)) != ncol(test),]
test$school <-rownames(test)
names(test)<-c('teacher','DC','grade','ssid','school') # assign col names
rownames(test)<-seq(1,nrow(test),1)
test$school<-gsub("\\.[0-9]*$","",test$school) # remove unneccessary numbers in school names
test <-test[,c(5,1:4)] # change column order
test <- apply(test,2,trimws); test

> test
  X_Elementary_School..3 Y_Elementary_School..2
1                  Bill:                  John:
2             X District             X District
4                      K                      K
5            123-2222-2:            112-1111-3:

Any Ideas?有任何想法吗?

Thanks!谢谢!

Assuming the data you want is always in the same row position, you can just subset the table by position.假设您想要的数据始终在同一行 position 中,您可以通过 position 对表进行子集化。 Here, the columns will be every-other starting from 3 .在这里,列将从3开始每隔一个。

library(tidyverse)

select_cols <- seq(from = 3, to = ncol(df), by = 2)
slice_rows <- c(1:2, 4:5)

df %>% 
  select(any_of(select_cols)) %>% 
  slice(slice_rows)
#>   X_Elementary_School..3 Y_Elementary_School..2
#> 1                  Bill:                  John:
#> 2             X District             X District
#> 3                      K                      K
#> 4            123-2222-2:            112-1111-3:

Created on 2022-09-21 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 9 月 21 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM