简体   繁体   English

选择多列,然后从宽到长整形

[英]Select multiple columns and reshape wide to long

I have wide dataset relating to cases and their contacts. 我拥有与案件及其联系方式有关的广泛数据集。 (This is a made up example; the real dataset is much larger). (这是一个虚构的示例;实际数据集要大得多)。

structure(list(record_id = structure(1:4, .Label = c("01-001", 
"01-002", "01-003", "01-004"), class = "factor"), place = structure(c(1L, 
2L, 1L, 1L), .Label = c("a", "b"), class = "factor"), sex = structure(c(2L, 
2L, 1L, 2L), .Label = c("F", "M"), class = "factor"), age = c(4L, 
13L, 28L, 44L), d02_1 = c(2L, 2L, NA, 2L), d02_2 = structure(c(3L, 
2L, 1L, 3L), .Label = c("", "F", "M"), class = "factor"), d02_3 = c(27L, 
16L, NA, 66L), d03_1 = c(3L, 3L, NA, 3L), d03_2 = structure(c(3L, 
3L, 1L, 2L), .Label = c("", "F", "M"), class = "factor"), d03_3 = c(14L, 
55L, NA, 12L), d04_1 = c(4L, NA, NA, NA), d04_2 = structure(c(2L, 
1L, 1L, 1L), .Label = c("", "M"), class = "factor"), d04_3 = c(7L, 
NA, NA, NA)), .Names = c("record_id", "place", "sex", "age", 
"d02_1", "d02_2", "d02_3", "d03_1", "d03_2", "d03_3", "d04_1", 
"d04_2", "d04_3"), row.names = c(NA, -4L), class = "data.frame")

Where: 哪里:

  • record_id is the unique identifier of the case record_id是案件的唯一标识符
  • place is the place where the case lives 地方是案件存在的地方
  • age is case's age 年龄是案件的年龄
  • sex is case's sex 性是案例的性

  • d02_1, d03_1, d04_1 ... d0j_1 are contact's ids d02_1,d03_1,d04_1 ... d0j_1是联系人的ID

  • d02_2, d03_2, d04_2 ... d0j_2 are contact's sex d02_2,d03_2,d04_2 ... d0j_2是联系人的性别
  • d02_3, d03_3, d04_3 ... d0j_3 are contact's age d02_3,d03_3,d04_3 ... d0j_3是联系人的年龄

In the real dataset, there are potentially many contacts per case, and many more variables relating to contact's characteristics. 在实际数据集中,每个案例中可能有许多联系人,并且还有更多与联系人特征有关的变量。 Not all cases will have contacts. 并非所有案例都有联系。

I want to reshape the dataset to a tidy format, with one row per case/contact, ie: 我想将数据集重整为整齐的格式,每个案例/联系人一行,即:

         id case place sex age
1    01-001    1     a   M   4
2  01-001-2    0     a   M  27
3  01-001-3    0     a   M  14
4  01-001-4    0     a   M   7
5    01-002    1     b   M  13
6  01-002-2    0     b   F  16
7  01-002-3    0     b   M  55
8    01-003    1     a   F  28
9    01-004    1     a   M  44
10 01-004-2    0     a   M  66
11 01-004-3    0     a   F  12

I am thinking that I will need to create vectors of columns names relating to each contact (potentially using character-matching on column names), select these columns sequentially, and append them to each other (as well as concatenating the case/contact ids), but really struggling to without lots and lots of copying of lines of code. 我在想,我将需要创建与每个联系人相关的列名称的向量(可能在列名称上使用字符匹配),依次选择这些列,并将其彼此附加(以及连接大小写/联系人ID) ,但确实很难做到没有很多行代码的复制。 Must be a more efficient method? 必须是一种更有效的方法?

Is this what you are looking for? 这是你想要的?

It is a dplyr solution that is ugly for a number of reasons, but I think it gets the job done. 这是一个dplyr解决方案,出于多种原因,这很丑陋,但我认为它可以完成工作。

DF <- DF %>%
  rename_(.dots=setNames(names(.), gsub('_1','_ContactID',names(.)))) %>%
  rename_(.dots=setNames(names(.), gsub('_2','_sex',names(.)))) %>%
  rename_(.dots=setNames(names(.), gsub('_3','_age',names(.)))) %>%
  rename(d00_sex=sex,d00_age=age) %>%
  mutate(d00_ContactID=1) %>%
  gather(Var,Val,-record_id,-place) %>%
  mutate(Val =ifelse(Val=='',NA,Val)) %>%
  separate(Var,c('ContactLevel','Var'),sep='_') %>%
  spread(Var,Val) %>%
  arrange(record_id,ContactLevel) %>%
  filter(!is.na(age),!is.na(ContactID),!is.na(sex)) %>%
  mutate(age = as.numeric(age))

I start off by renaming your variables for clarity. 为了清楚起见,我首先重命名您的变量。 ( rename_ lines) rename_行)

Next, I put your case info variables into a consistent pattern where the case info is ContactID=1. 接下来,我将您的案例信息变量放入一个一致的模式,其中案例信息为ContactID = 1。 ( ename and mutate lines) (对行进行enamemutate

Gather turns the data from wide to long, but leaves us with one very ugly column and converts all your data to character. Gather将数据从宽变长,但是却给我们留下了非常难看的一列,并将您所有的数据转换为字符。 (This is the ugly part where the warning is triggered.) (这是触发警告的丑陋部分。)

separate splits the old column names into Contact ID and the data column. separate将旧的列名称拆分为联系人ID和数据列。

spread then opens up the age, sex and ID into columns again. spread然后再次打开年龄,性别和身分证明。 At this line these data are what you want, but can still be cleaned up a bit. 在这一行中,这些数据是您想要的,但是仍然可以进行一些清理。

arrange is not necessary, but it puts all of the record IDs together. arrange不是必需的,但是它将所有记录ID放在一起。

filter is also not necessary, it just removes the rows with no contract information. filter也不是必需的,它只删除没有合同信息的行。

Finally, I use mutate to turn age from character to numeric. 最后,我使用mutateage从字符转换为数字。 If you wish you can also turn sex into a factor here, and possibly contact ID as well. 如果您愿意,您也可以在这里将性行为变成一个因素,并可能还会联系ID。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM