简体   繁体   English

整理数据:重命名列,获取非NA列名称,然后收集

[英]Tidy Data: Rename columns, get non-NA column names, then gather

I've got a rather ugly bit of data to tidy up and need help! 我整理的数据非常难看,需要帮助! What my data look like now: 我的数据现在看起来像什么:

countries <- c("Austria", "Belgium", "Croatia")

df <- tibble("age" = c(28,42,19, 67),
         "1_recreate_1"=c(NA,15,NA,NA), 
         "1_recreate_2"=c(NA,10,NA,NA), 
         "1_recreate_3"=c(NA,8,NA,NA),
         "1_recreate_4"=c(NA,4,NA,NA),
         "1_fairness" = c(NA, 7, NA, NA),
         "1_confidence" = c(NA, 5, NA, NA),
         "2_recreate_1"=c(29,NA,NA,30),
         "2_recreate_2"=c(20,NA,NA,24),
         "2_recreate_3"=c(15,NA,NA,15),
         "2_recreate_4"=c(11,NA,NA,9),
         "2_fairness" = c(4, NA, NA, 1),
         "2_confidence" = c(5, NA, NA, 4),
         "3_recreate_1"=c(NA,NA,50,NA), 
         "3_recreate_2"=c(NA,NA,40,NA), 
         "3_recreate_3"=c(NA,NA,30,NA),
         "3_recreate_4"=c(NA,NA,20,NA),
         "3_fairness" = c(NA,  NA, 2, NA),
         "3_confidence" = c(NA, NA, 2, NA),
         "overall" = c(3,3,2,5))    

What I need them to look like at the end (hard-coding it): 我需要它们在最后看起来像什么(对它进行硬编码):

df <- tibble(age = rep(c(28,42,19,67), each=4),
         country = rep(c("Belgium", "Austria", "Croatia", "Belgium"), each=4),
         recreate = rep(1:4, times=4),
         fairness = rep(c(4,7,2,1), each=4),
         confidence = rep(c(5,5,2,4), each=4),     
         allocation = c(29, 20, 15, 11,
                        15, 10, 8, 4,
                        50, 40, 30, 20, 
                        30, 24, 15, 9),
         overall = rep(c(3,3,2,5), each=4))

Steps to get there (I think!): 到达那里的步骤(我认为!):

1. Replace the starting numbers for those columns using my list of countries. 1.使用我的国家/地区列表替换这些列的起始编号。
The number that starts the string is the index in countries . 字符串开头的数字是countries的索引。 In other words, 16_recreate_1 would correspond with the 16th country in the vector countries . 换句话说, 16_recreate_1将对应于vector country中的第16 countries I think the following code works (though am not sure it's exactly right): 我认为以下代码可以工作(尽管不确定是否完全正确):

for(i in length(countries):1){
    colnames(df) <- str_replace(colnames(df), paste0(i,"_"), paste0(countries[i],"_"))
}  

2. Create a new variable called "country" by getting the name of the column(s) that is NOT NA for each row. 2.通过获取每一行不是NA的列名来创建一个名为“ country”的新变量。

I tried a BUNCH of experimentation with which.max and names , but couldn't get it fully functional. 我尝试了使用which.maxnames大量实验,但无法完全发挥作用。

3. Create new variables ( recreate_1 ... recreate_4 ) that grab the [country_name]_recreate_1 ... [country_name]_recreate_4 value for each row, whatever country is non-NA for that person. 3.创建新变量( recreate_1 ... recreate_4 ),以获取每行的[country_name]_recreate_1 ... [country_name]_recreate_4值,无论该人所在国家/地区是否为非NA。

Maybe rowSums is the way to do this? 也许rowSums是做到这一点的方法?

4. Make the data long instead of wide I think this is going to require gather , but I'm not sure how to gather from only the variables country and recreate_1 ... recreate_4 . 4.使数据变长而不是变宽我认为这将需要gather ,但是我不确定如何仅从变量countryrecreate_1 ... recreate_4进行收集。

I'm so sorry this is so complex. 很抱歉,这是如此复杂。 Tidyverse solutions are preferred but any help is greatly appreciated! Tidyverse解决方案是首选,但任何帮助是极大的赞赏!

library(dplyr)
library(tidyr)
df %>% mutate(rid=row_number()) %>% 
       gather(key,val,-c(age,overall,rid, matches('recreate'))) %>% mutate(country=sub('(^\\d)_.*','\\1',key),country=countries[as.numeric(country)]) %>% 
       filter(!is.na(val)) %>% mutate(key=sub('(^\\d\\_)(.*)','\\2',key)) %>%
       spread(key,val) %>% gather(key = recreate,value = allocation,-c(rid,age,overall,Country,confidence,fairness)) %>% 
       filter(!is.na(allocation)) %>% mutate(recreate=sub('.*_(\\d$)','\\1',recreate))

Here (^\\\\d)_.* means get the first digit while .*_(\\\\d$) means get the last digit. 此处(^\\\\d)_.*表示获取第一个数字,而.*_(\\\\d$)表示获取最后一个数字。

A somehow different tidyverse possibility could be: 某种不同的tidyverse可能性可能是:

df %>%
 gather(variable, allocation, na.rm = TRUE) %>%
 separate(variable, c("ID", "variable", "recreate"), convert = TRUE) %>%
 left_join(data.frame(countries) %>%
            mutate(country = countries,
                   ID = seq_along(countries)) %>%
            select(-countries), by = c("ID" = "ID")) %>%
 select(-variable, -ID) 

   recreate allocation country
      <int>      <dbl> <fct>  
 1        1         15 Austria
 2        2         10 Austria
 3        3          8 Austria
 4        4          4 Austria
 5        1         29 Belgium
 6        1         30 Belgium
 7        2         20 Belgium
 8        2         24 Belgium
 9        3         15 Belgium
10        3         15 Belgium
11        4         11 Belgium
12        4          9 Belgium
13        1         50 Croatia
14        2         40 Croatia
15        3         30 Croatia
16        4         20 Croatia

Here it, first, transforms the data from wide to long format, removing the rows with NA. 在这里,它首先将数据从宽格式转换为长格式,并用NA删除行。 Second, it separates the variable names into three columns. 其次,它将变量名称分为三列。 Third, it transforms the vector of countries into a df and assigns each country a unique ID. 第三,它将国家/地区的向量转换为df,并为每个国家/地区分配一个唯一的ID。 Finally, it joins the two and removes the redundant variables. 最后,它将两者合并,并删除冗余变量。

A solution to the edited question: 已编辑问题的解决方案:

df %>%
 select(matches("(recreate)")) %>%
 rowid_to_column() %>%
 gather(var, allocation, -rowid, na.rm = TRUE) %>%
 separate(var, c("ID", "var", "recreate"), convert = TRUE) %>%
 select(-var) %>%
 left_join(data.frame(countries) %>%
            mutate(country = countries,
                   ID = seq_along(countries)) %>%
            select(-countries), by = c("ID" = "ID")) %>% 
 left_join(df %>%
            select(-matches("(recreate)")) %>%
            rowid_to_column() %>%
            gather(var, val, -rowid, na.rm = TRUE) %>%
            mutate(var = gsub("[^[:alpha:]]", "", var)) %>%
            spread(var, val), by = c("rowid" = "rowid")) %>%
 select(-rowid, -ID)

   recreate allocation country   age confidence fairness overall
      <int>      <dbl> <fct>   <dbl>      <dbl>    <dbl>   <dbl>
 1        1         15 Austria    42          5        7       3
 2        2         10 Austria    42          5        7       3
 3        3          8 Austria    42          5        7       3
 4        4          4 Austria    42          5        7       3
 5        1         29 Belgium    28          5        4       3
 6        1         30 Belgium    67          4        1       5
 7        2         20 Belgium    28          5        4       3
 8        2         24 Belgium    67          4        1       5
 9        3         15 Belgium    28          5        4       3
10        3         15 Belgium    67          4        1       5
11        4         11 Belgium    28          5        4       3
12        4          9 Belgium    67          4        1       5
13        1         50 Croatia    19          2        2       2
14        2         40 Croatia    19          2        2       2
15        3         30 Croatia    19          2        2       2
16        4         20 Croatia    19          2        2       2

Here it, first, selects the columns that contain recreate and adds a columns with row ID. 首先,在这里选择包含recreate的列,并添加具有行ID的列。 Second, it follows the steps from the original solution. 其次,它遵循原始解决方案中的步骤。 Third, it selects the columns that do not contain recreate , performs a wide-to-long data transformation, removes the number from column names and transforms the data back to the original wide format. 第三,它选择不包含recreate的列,执行从宽到长的数据转换,从列名中删除数字,然后将数据转换回原始的宽格式。 Finally, it joins the two on row ID and removes the redundant variables. 最后,它将两个行ID结合在一起,并删除冗余变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM