I've got a rather ugly bit of data to tidy up and need help! What my data look like now:
countries <- c("Austria", "Belgium", "Croatia")
df <- tibble("age" = c(28,42,19, 67),
"1_recreate_1"=c(NA,15,NA,NA),
"1_recreate_2"=c(NA,10,NA,NA),
"1_recreate_3"=c(NA,8,NA,NA),
"1_recreate_4"=c(NA,4,NA,NA),
"1_fairness" = c(NA, 7, NA, NA),
"1_confidence" = c(NA, 5, NA, NA),
"2_recreate_1"=c(29,NA,NA,30),
"2_recreate_2"=c(20,NA,NA,24),
"2_recreate_3"=c(15,NA,NA,15),
"2_recreate_4"=c(11,NA,NA,9),
"2_fairness" = c(4, NA, NA, 1),
"2_confidence" = c(5, NA, NA, 4),
"3_recreate_1"=c(NA,NA,50,NA),
"3_recreate_2"=c(NA,NA,40,NA),
"3_recreate_3"=c(NA,NA,30,NA),
"3_recreate_4"=c(NA,NA,20,NA),
"3_fairness" = c(NA, NA, 2, NA),
"3_confidence" = c(NA, NA, 2, NA),
"overall" = c(3,3,2,5))
What I need them to look like at the end (hard-coding it):
df <- tibble(age = rep(c(28,42,19,67), each=4),
country = rep(c("Belgium", "Austria", "Croatia", "Belgium"), each=4),
recreate = rep(1:4, times=4),
fairness = rep(c(4,7,2,1), each=4),
confidence = rep(c(5,5,2,4), each=4),
allocation = c(29, 20, 15, 11,
15, 10, 8, 4,
50, 40, 30, 20,
30, 24, 15, 9),
overall = rep(c(3,3,2,5), each=4))
Steps to get there (I think!):
1. Replace the starting numbers for those columns using my list of countries.
The number that starts the string is the index in countries
. In other words, 16_recreate_1
would correspond with the 16th country in the vector countries
. I think the following code works (though am not sure it's exactly right):
for(i in length(countries):1){
colnames(df) <- str_replace(colnames(df), paste0(i,"_"), paste0(countries[i],"_"))
}
2. Create a new variable called "country" by getting the name of the column(s) that is NOT NA for each row.
I tried a BUNCH of experimentation with which.max
and names
, but couldn't get it fully functional.
3. Create new variables ( recreate_1
... recreate_4
) that grab the [country_name]_recreate_1
... [country_name]_recreate_4
value for each row, whatever country is non-NA for that person.
Maybe rowSums
is the way to do this?
4. Make the data long instead of wide I think this is going to require gather
, but I'm not sure how to gather from only the variables country
and recreate_1
... recreate_4
.
I'm so sorry this is so complex. Tidyverse solutions are preferred but any help is greatly appreciated!
library(dplyr)
library(tidyr)
df %>% mutate(rid=row_number()) %>%
gather(key,val,-c(age,overall,rid, matches('recreate'))) %>% mutate(country=sub('(^\\d)_.*','\\1',key),country=countries[as.numeric(country)]) %>%
filter(!is.na(val)) %>% mutate(key=sub('(^\\d\\_)(.*)','\\2',key)) %>%
spread(key,val) %>% gather(key = recreate,value = allocation,-c(rid,age,overall,Country,confidence,fairness)) %>%
filter(!is.na(allocation)) %>% mutate(recreate=sub('.*_(\\d$)','\\1',recreate))
Here (^\\\\d)_.*
means get the first digit while .*_(\\\\d$)
means get the last digit.
A somehow different tidyverse
possibility could be:
df %>%
gather(variable, allocation, na.rm = TRUE) %>%
separate(variable, c("ID", "variable", "recreate"), convert = TRUE) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
select(-variable, -ID)
recreate allocation country
<int> <dbl> <fct>
1 1 15 Austria
2 2 10 Austria
3 3 8 Austria
4 4 4 Austria
5 1 29 Belgium
6 1 30 Belgium
7 2 20 Belgium
8 2 24 Belgium
9 3 15 Belgium
10 3 15 Belgium
11 4 11 Belgium
12 4 9 Belgium
13 1 50 Croatia
14 2 40 Croatia
15 3 30 Croatia
16 4 20 Croatia
Here it, first, transforms the data from wide to long format, removing the rows with NA. Second, it separates the variable names into three columns. Third, it transforms the vector of countries into a df and assigns each country a unique ID. Finally, it joins the two and removes the redundant variables.
A solution to the edited question:
df %>%
select(matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, allocation, -rowid, na.rm = TRUE) %>%
separate(var, c("ID", "var", "recreate"), convert = TRUE) %>%
select(-var) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
left_join(df %>%
select(-matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, val, -rowid, na.rm = TRUE) %>%
mutate(var = gsub("[^[:alpha:]]", "", var)) %>%
spread(var, val), by = c("rowid" = "rowid")) %>%
select(-rowid, -ID)
recreate allocation country age confidence fairness overall
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 15 Austria 42 5 7 3
2 2 10 Austria 42 5 7 3
3 3 8 Austria 42 5 7 3
4 4 4 Austria 42 5 7 3
5 1 29 Belgium 28 5 4 3
6 1 30 Belgium 67 4 1 5
7 2 20 Belgium 28 5 4 3
8 2 24 Belgium 67 4 1 5
9 3 15 Belgium 28 5 4 3
10 3 15 Belgium 67 4 1 5
11 4 11 Belgium 28 5 4 3
12 4 9 Belgium 67 4 1 5
13 1 50 Croatia 19 2 2 2
14 2 40 Croatia 19 2 2 2
15 3 30 Croatia 19 2 2 2
16 4 20 Croatia 19 2 2 2
Here it, first, selects the columns that contain recreate
and adds a columns with row ID. Second, it follows the steps from the original solution. Third, it selects the columns that do not contain recreate
, performs a wide-to-long data transformation, removes the number from column names and transforms the data back to the original wide format. Finally, it joins the two on row ID and removes the redundant variables.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.