[英]How to create fewer columns from a multitude of columns with NA's?
I have converted a long format data.frame to wide in order to merge it with another dataframe. 我已经将长格式的data.frame转换为wide,以便将其与另一个dataframe合并。 When I transformed the long to wide I got a lot of NA's and I would like to eliminate these NA's and create some new columns with the data existing.
当我将多边型转换为宽型时,我得到了很多NA,我想消除这些NA,并使用现有数据创建一些新列。
The long data can have multiple levels for the same ID. 对于同一个ID,长数据可以具有多个级别。 I want all levels to be in a wide format rather than long.
我希望所有级别都采用较宽的格式,而不是较长的格式。 Because I have more than 40 levels in the long data, when I transform it to wide using "dcast" I get a lot of columns with tons of NA's.
因为我的长数据有40多个级别,所以当我使用“ dcast”将其转换为宽级别时,我会得到很多带有大量NA的列。 I have tried a lot of ways to merge these columns in order to eliminate as many NA's as posible but it did not work.
我尝试了许多方法来合并这些列,以消除尽可能多的NA,但这是行不通的。
My data looks like this: 我的数据如下所示:
ID | Date | Gender | Age | Name1 | Name2 | Name3 | Name4 | ... | NameN |
----------------------------------------------------------------------
1 1/1 F 1 NA Name2 Name3 NA NameN
2 2/2 M 2 NA NA Name3 NA NA
3 3/3 F 3 NA Name2 Name3 NA NA
4 4/4 F 4 Name1 NA Name3 NA NA
5 5/5 F 5 NA NA NA Name4 NA
6 6/6 M 6 NA NA NA NA NA
7 7/7 F 7 NA NA NA NA NA
8 8/8 F 8 NA NA NA NA NA
I would like to get something that looks like this 我想得到像这样的东西
ID | Date | Gender | Age | Risk1 | Risk2| ...| RiskN |
------------------------------------------------------
1 1/1 F 1 Name2 Name3 NameN
2 2/2 M 2 Name3 NA NA
3 3/3 F 3 Name2 Name3 NA
4 4/4 F 4 Name1 Name3 NA
5 5/5 F 5 Name4 NA NA
6 6/6 M 6 NA NA NA
7 7/7 F 7 NA NA NA
8 8/8 F 8 NA NA NA
Edit1: Thanks for the answers, unfortunately neither of them give the expected output. Edit1:感谢您的回答,不幸的是,他们都没有给出预期的输出。 I edited the data above to include a few more entries which I have in my data and are getting excluded completely.
我编辑了上面的数据,以包括我的数据中另外一些条目,这些条目将被完全排除在外。 Also I nave 45 variables (Name1, Name2 ... Name45).
我也有45个变量(Name1,Name2 ... Name45)。 Based on the second answer I received I should have only 9 Risk variables left.
根据我收到的第二个答案,我应该只剩下9个Risk变量。 Sorry for the confusion!
对困惑感到抱歉!
The output for the first answer is eliminating all the rows similar to 6:8 rows. 第一个答案的输出是消除所有类似于6:8行的行。 Also the remaining data does not look as expected above but more like:
另外,其余数据看起来与上面的预期不同,但更像是:
ID | Date | Gender | Age | RiskName1 | RiskName2 | RiskName3 | RiskName4 | ... | RiskNameN
------------------------------------------------------------------------------------------
4 4/4 F 4 Name1 NA Name3 NA NA
1 1/1 F 1 NA Name2 Name3 NA NameN
3 3/3 F 3 NA Name2 Name3 NA NA
2 2/2 M 2 NA NA Name3 NA NA
5 5/5 F 5 NA NA NA Name4 NA
The second anwer still eliminates data similar to 6:8 but performs better in terms of actually eliminating the large number of columns existing but it replaces all the row content with numbers. 第二个响应仍然消除了类似于6:8的数据,但是在实际上消除了现有的大量列方面表现更好,但是它用数字替换了所有行内容。 Eg
例如
ID | Date | Gender | Age | Risk1 | Risk2| Risk3 |
-------------------------------------------------
1 1/1 F 1 1 1 1
2 2/2 M 2 1 0 0
3 3/3 F 3 1 1 0
4 4/4 F 4 1 1 0
5 5/5 F 5 1 0 0
Edit2: The data is sensitive, but I created a very similar structure for you to work with. Edit2:数据很敏感,但是我创建了一个非常相似的结构供您使用。 Thanks!
谢谢!
Sample data: 样本数据:
structure(list(Ref = c("213", "42", "512", "123","421"), Start = structure(c(1541912880, 1541912880, 1541918160,1541918160,1542024180), class = c("POSIXct", "POSIXt"), tzone = "UTC"),Age = c(1, 7, 8, 6, 3), Gender = c("Female", "Male", "Female","Female", "Female"), Ethnicity = c("E2", "E1", "E4", "E1", "E1"), Cats = c("cats", "cats", NA_character_,NA_character_, NA_character_), Dogs = c(NA_character_,NA_character_, NA_character_, "dogs", NA_character_), Iguanas = c(NA_character_, "Iguanas", NA_character_, "Iguanas", NA_character_), Coalas = c(NA_character_, NA_character_, NA_character_, NA_character_, NA_character_), Ducks = c("ducks", NA_character_,"ducks",NA_character_, NA_character_)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
How I would like it to look: 我希望它看起来如何:
Ref | Date | Gender | Age | Risk1 | Risk2| Risk3 |
---------------------------------------------------------
213 2018-11-11 F 1 cats ducks NA
42 2018-11-11 M 7 cats Iguanas NA
512 2018-11-11 F 8 ducks NA NA
123 2018-11-11 F 6 dogs Iguanas NA
421 2018-11-12 F 3 NA NA NA
An option would be gather
the 'Name' columns into 'long' format whilee removing the NA
with na.rm = TRUE
, then grouped by 'ID', create the 'Risk' as a sequence column and spread
back to 'wide' format 一个选项是将'Name'列
gather
为'long'格式,同时使用na.rm = TRUE
删除NA
,然后按'ID'分组,将'Risk'创建为序列列并spread
回'wide'格式
library(tidyverse)
gather(df1, Risk, val, starts_with("Name"), na.rm = TRUE) %>%
group_by(ID) %>%
mutate(Risk = str_c("Risk", Risk)) %>%
spread(Risk, val)
With new updated dataset 使用新的更新数据集
df2 %>%
gather(Risk, val, Cats:Ducks) %>%
mutate(Ref = factor(Ref, levels = unique(Ref))) %>%
arrange(Ref, is.na(val)) %>%
group_by(Ref) %>%
slice(if(all(is.na(val))) 1 else which(!is.na(val))) %>%
mutate(Risk = str_c('Risk', row_number())) %>%
spread(Risk, val)
# A tibble: 5 x 7
# Groups: Ref [5]
# Ref Start Age Gender Ethnicity Risk1 Risk2
# <fct> <dttm> <dbl> <chr> <chr> <chr> <chr>
#1 213 2018-11-11 05:08:00 1 Female E2 cats ducks
#2 42 2018-11-11 05:08:00 7 Male E1 cats Iguanas
#3 512 2018-11-11 06:36:00 8 Female E4 ducks <NA>
#4 123 2018-11-11 06:36:00 6 Female E1 dogs Iguanas
#5 421 2018-11-12 12:03:00 3 Female E1 <NA> <NA>
Similar convert to long then back to wide approach, with data.table 使用data.table将类似的方法转换为长而后的方法
library(data.table)
setDT(df)
long <- melt(df, which(!names(df) %like% 'Name'), na.rm = T)
dcast(long[, -'variable'], ... ~ paste0('Risk', rowid(ID)))
# Date Gender Age Risk1 Risk2
# 1: 1/1 F 1 Name2 Name3
# 2: 2/2 M 2 Name3 <NA>
# 3: 3/3 F 3 Name2 Name3
# 4: 4/4 F 4 Name1 Name3
# 5: 5/5 F 5 Name4 <NA>
Data used: 使用的数据:
df <- fread('
ID Date Gender Age Name1 Name2 Name3 Name4
1 1/1 F 1 NA Name2 Name3 NA
2 2/2 M 2 NA NA Name3 NA
3 3/3 F 3 NA Name2 Name3 NA
4 4/4 F 4 Name1 NA Name3 NA
5 5/5 F 5 NA NA NA Name4
')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.