[英]Tidy Data: Rename columns, get non-NA column names, then gather
我整理的數據非常難看,需要幫助! 我的數據現在看起來像什么:
countries <- c("Austria", "Belgium", "Croatia")
df <- tibble("age" = c(28,42,19, 67),
"1_recreate_1"=c(NA,15,NA,NA),
"1_recreate_2"=c(NA,10,NA,NA),
"1_recreate_3"=c(NA,8,NA,NA),
"1_recreate_4"=c(NA,4,NA,NA),
"1_fairness" = c(NA, 7, NA, NA),
"1_confidence" = c(NA, 5, NA, NA),
"2_recreate_1"=c(29,NA,NA,30),
"2_recreate_2"=c(20,NA,NA,24),
"2_recreate_3"=c(15,NA,NA,15),
"2_recreate_4"=c(11,NA,NA,9),
"2_fairness" = c(4, NA, NA, 1),
"2_confidence" = c(5, NA, NA, 4),
"3_recreate_1"=c(NA,NA,50,NA),
"3_recreate_2"=c(NA,NA,40,NA),
"3_recreate_3"=c(NA,NA,30,NA),
"3_recreate_4"=c(NA,NA,20,NA),
"3_fairness" = c(NA, NA, 2, NA),
"3_confidence" = c(NA, NA, 2, NA),
"overall" = c(3,3,2,5))
我需要它們在最后看起來像什么(對它進行硬編碼):
df <- tibble(age = rep(c(28,42,19,67), each=4),
country = rep(c("Belgium", "Austria", "Croatia", "Belgium"), each=4),
recreate = rep(1:4, times=4),
fairness = rep(c(4,7,2,1), each=4),
confidence = rep(c(5,5,2,4), each=4),
allocation = c(29, 20, 15, 11,
15, 10, 8, 4,
50, 40, 30, 20,
30, 24, 15, 9),
overall = rep(c(3,3,2,5), each=4))
到達那里的步驟(我認為!):
1.使用我的國家/地區列表替換這些列的起始編號。
字符串開頭的數字是countries
的索引。 換句話說, 16_recreate_1
將對應於vector country中的第16 countries
。 我認為以下代碼可以工作(盡管不確定是否完全正確):
for(i in length(countries):1){
colnames(df) <- str_replace(colnames(df), paste0(i,"_"), paste0(countries[i],"_"))
}
2.通過獲取每一行不是NA的列名來創建一個名為“ country”的新變量。
我嘗試了使用which.max
和names
大量實驗,但無法完全發揮作用。
3.創建新變量( recreate_1
... recreate_4
),以獲取每行的[country_name]_recreate_1
... [country_name]_recreate_4
值,無論該人所在國家/地區是否為非NA。
也許rowSums
是做到這一點的方法?
4.使數據變長而不是變寬我認為這將需要gather
,但是我不確定如何僅從變量country
和recreate_1
... recreate_4
進行收集。
很抱歉,這是如此復雜。 Tidyverse解決方案是首選,但任何幫助是極大的贊賞!
library(dplyr)
library(tidyr)
df %>% mutate(rid=row_number()) %>%
gather(key,val,-c(age,overall,rid, matches('recreate'))) %>% mutate(country=sub('(^\\d)_.*','\\1',key),country=countries[as.numeric(country)]) %>%
filter(!is.na(val)) %>% mutate(key=sub('(^\\d\\_)(.*)','\\2',key)) %>%
spread(key,val) %>% gather(key = recreate,value = allocation,-c(rid,age,overall,Country,confidence,fairness)) %>%
filter(!is.na(allocation)) %>% mutate(recreate=sub('.*_(\\d$)','\\1',recreate))
此處(^\\\\d)_.*
表示獲取第一個數字,而.*_(\\\\d$)
表示獲取最后一個數字。
某種不同的tidyverse
可能性可能是:
df %>%
gather(variable, allocation, na.rm = TRUE) %>%
separate(variable, c("ID", "variable", "recreate"), convert = TRUE) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
select(-variable, -ID)
recreate allocation country
<int> <dbl> <fct>
1 1 15 Austria
2 2 10 Austria
3 3 8 Austria
4 4 4 Austria
5 1 29 Belgium
6 1 30 Belgium
7 2 20 Belgium
8 2 24 Belgium
9 3 15 Belgium
10 3 15 Belgium
11 4 11 Belgium
12 4 9 Belgium
13 1 50 Croatia
14 2 40 Croatia
15 3 30 Croatia
16 4 20 Croatia
在這里,它首先將數據從寬格式轉換為長格式,並用NA刪除行。 其次,它將變量名稱分為三列。 第三,它將國家/地區的向量轉換為df,並為每個國家/地區分配一個唯一的ID。 最后,它將兩者合並,並刪除冗余變量。
已編輯問題的解決方案:
df %>%
select(matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, allocation, -rowid, na.rm = TRUE) %>%
separate(var, c("ID", "var", "recreate"), convert = TRUE) %>%
select(-var) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
left_join(df %>%
select(-matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, val, -rowid, na.rm = TRUE) %>%
mutate(var = gsub("[^[:alpha:]]", "", var)) %>%
spread(var, val), by = c("rowid" = "rowid")) %>%
select(-rowid, -ID)
recreate allocation country age confidence fairness overall
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 15 Austria 42 5 7 3
2 2 10 Austria 42 5 7 3
3 3 8 Austria 42 5 7 3
4 4 4 Austria 42 5 7 3
5 1 29 Belgium 28 5 4 3
6 1 30 Belgium 67 4 1 5
7 2 20 Belgium 28 5 4 3
8 2 24 Belgium 67 4 1 5
9 3 15 Belgium 28 5 4 3
10 3 15 Belgium 67 4 1 5
11 4 11 Belgium 28 5 4 3
12 4 9 Belgium 67 4 1 5
13 1 50 Croatia 19 2 2 2
14 2 40 Croatia 19 2 2 2
15 3 30 Croatia 19 2 2 2
16 4 20 Croatia 19 2 2 2
首先,在這里選擇包含recreate
的列,並添加具有行ID的列。 其次,它遵循原始解決方案中的步驟。 第三,它選擇不包含recreate
的列,執行從寬到長的數據轉換,從列名中刪除數字,然后將數據轉換回原始的寬格式。 最后,它將兩個行ID結合在一起,並刪除冗余變量。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.