[英]Collapsing Row Data with Columns (character, numeric, factors, etc.)
I am trying to collapse this data but I am having trouble.我试图折叠这些数据,但我遇到了麻烦。 The dataset is huge - more than 100 columns, and over 1,000 rows.
数据集非常庞大——超过 100 列,超过 1,000 行。
This is an example of the dataset of how it looks like:这是其外观的数据集示例:
https://i.stack.imgur.com/8iZq7.png https://i.stack.imgur.com/8iZq7.png
I need to be able to collapse the rows together.我需要能够将行折叠在一起。 I cannot add the values inside Lab together because it'll be greater than 1.
我无法将 Lab 中的值相加,因为它会大于 1。
I have tried multiple codes, and it doesn't work because it doesn't take into consideration that I have character, numeric, and timestamps in my dataframe.我尝试了多个代码,但它不起作用,因为它没有考虑到我的 dataframe 中有字符、数字和时间戳。
These are the codes that I have tried with the errors:这些是我尝试过的错误代码:
COLLAPSE6 <- setDT(TRIALBJH4)[, lapply(.SD, function(x)
{x <- unique(x[!is.na(x)])
if(length(x) == 1) as.character(x)
else if(length(x) == 0) NA_character_
else collapse=","}),
by=ID]
This just added a comma into the columns (considered as multiple) when I need it to either say 0, 1, or NA当我需要它说 0、1 或 NA 时,这只是在列中添加了一个逗号(被认为是多个)
COLLAPSE3 %>%
group_by(ID) %>%
summarise_all(funs(list(na.omit)))
This just replaced the other columns not listed in the group_by with funs(list(na.omit) - it even replaced the values with it这只是用 funs(list(na.omit) 替换了 group_by 中未列出的其他列 - 它甚至用它替换了值
bjh_sti_merge1 <- bjh_sti_merg6 %>% group_by (ID) %>%
summarise_each(funs(max(., na.rm = TRUE)))
This doesn't work - it freezes R for me, and I always have to force quit it这不起作用 - 它为我冻结了 R,我总是不得不强制退出它
bjh_sti_merg10 <- bjh_sti_merg6 %>% group_by (ID) %>%
summarise(AGE = max(AGE, na.rm=TRUE),
LAB1 = max(LAB1, na.rm=TRUE),
LAB3 = max(LAB3, na.rm=TRUE))
This one doesn't work - it just takes the first row of the duplicated ones (I can't use this because sometimes the first row is NA, and the third row could have 1 in the column) - Also, this seems to freeze R when I have more than 20 columns in it这个不起作用 - 它只占用重复行的第一行(我不能使用它,因为有时第一行是 NA,第三行可能在列中有 1) - 而且,这似乎冻结R 当我有超过 20 列时
xx <-function(x) x[!is.na(x)]
bjh_sti_merg7 %>%
group_by(EPIC_MRN) %>%
summarise_all(funs(xx))
This doesn't work: it says: Error: Problem with 'summarise()' input 'LAB1'.这不起作用:它说:错误:'summarise()'输入'LAB1'有问题。 x Input 'LAB1' must be size 0 or 1, not 2.
x 输入“LAB1”的大小必须为 0 或 1,而不是 2。
I want the end result to have 1 row per ID.我希望最终结果每个 ID 有 1 行。 The code needs to work for all columns (character, numeric, timestamps, factors, etc.).
该代码需要适用于所有列(字符、数字、时间戳、因子等)。 and something that doesn't freeze RStudio for me.
以及对我来说不会冻结 RStudio 的东西。 I was always recommended summarise_each, but that kept freezing my laptop (I tried to let it run, it ran for over 2 hours and nothing) and yes, I have uploaded tidyverse, data.table, and dplyr
我总是被推荐 summarise_each,但它一直冻结我的笔记本电脑(我试图让它运行,它运行了 2 多个小时,但什么也没有),是的,我已经上传了 tidyverse、data.table 和 dplyr
This also needs to accept NA as well!这也需要接受 NA !
I would like the dataset to look like: https://i.stack.imgur.com/yBehQ.png我希望数据集看起来像: https://i.stack.imgur.com/yBehQ.png
See if this doesn't work, might take some time to run:看看这是否不起作用,可能需要一些时间才能运行:
plyr::ddply(df, plyr::.(ID), function(x){
res <- x[1,]
if(ncol(x) == 1) return(res)
for (i in 1:ncol(x)) {
if(class(x[,i]) != "numeric") next()
res[,i] <- max(x[,i], na.rm=T)
}
return(res)
})
This task should be straightforward.这个任务应该很简单。 It is not clear to me though how you wish to summarize the AGE, TIME, LAB1 and LAB2 columns.
我不清楚您希望如何总结 AGE、TIME、LAB1 和 LAB2 列。 For simplicity sake I have used
max(col, na.rm = TRUE)
.为简单起见,我使用了
max(col, na.rm = TRUE)
。
library(dplyr)
library(tibble)
data <- tibble(
ID = c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7),
SEX = c("M", "M", "M", "F", "F", "M", "M", "F", "F", "M", "M", "F"),
AGE = c(30, 30, 30, 22, 22, 55, 90, 87, 87, 23, 23, 45),
TIME = as.POSIXct(rep("02/19/2019 12:00", 12), format = "%m/%d/%Y %H:%M", tz = ""),
LAB1 = c(0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0),
LAB2 = c(1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1)
)
data <- data %>%
group_by(ID, SEX) %>%
summarize(AGE = max(AGE, na.rm = TRUE),
TIME = max(TIME, na.rm = TRUE),
LAB1 = max(LAB1, na.rm = TRUE),
LAB2 = max(LAB2, na.rm = TRUE))
With this result:有了这个结果:
> data
# A tibble: 7 x 6
# Groups: ID [7]
ID SEX AGE TIME LAB1 LAB2
<dbl> <chr> <dbl> <dttm> <dbl> <dbl>
1 1 M 30 2019-02-19 12:00:00 1 1
2 2 F 22 2019-02-19 12:00:00 1 1
3 3 M 55 2019-02-19 12:00:00 1 1
4 4 M 90 2019-02-19 12:00:00 1 1
5 5 F 87 2019-02-19 12:00:00 0 0
6 6 M 23 2019-02-19 12:00:00 1 1
7 7 F 45 2019-02-19 12:00:00 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.