简体   繁体   English

用列折叠行数据(字符、数字、因子等)

[英]Collapsing Row Data with Columns (character, numeric, factors, etc.)

I am trying to collapse this data but I am having trouble.我试图折叠这些数据,但我遇到了麻烦。 The dataset is huge - more than 100 columns, and over 1,000 rows.数据集非常庞大——超过 100 列,超过 1,000 行。

This is an example of the dataset of how it looks like:这是其外观的数据集示例:

https://i.stack.imgur.com/8iZq7.png https://i.stack.imgur.com/8iZq7.png

I need to be able to collapse the rows together.我需要能够将行折叠在一起。 I cannot add the values inside Lab together because it'll be greater than 1.我无法将 Lab 中的值相加,因为它会大于 1。

I have tried multiple codes, and it doesn't work because it doesn't take into consideration that I have character, numeric, and timestamps in my dataframe.我尝试了多个代码,但它不起作用,因为它没有考虑到我的 dataframe 中有字符、数字和时间戳。

These are the codes that I have tried with the errors:这些是我尝试过的错误代码:

COLLAPSE6 <- setDT(TRIALBJH4)[, lapply(.SD, function(x)
                      {x <- unique(x[!is.na(x)])
                       if(length(x) == 1) as.character(x)
                       else if(length(x) == 0) NA_character_
                       else collapse=","}),
             by=ID]

This just added a comma into the columns (considered as multiple) when I need it to either say 0, 1, or NA当我需要它说 0、1 或 NA 时,这只是在列中添加了一个逗号(被认为是多个)

COLLAPSE3 %>%
  group_by(ID) %>%
  summarise_all(funs(list(na.omit)))

This just replaced the other columns not listed in the group_by with funs(list(na.omit) - it even replaced the values with it这只是用 funs(list(na.omit) 替换了 group_by 中未列出的其他列 - 它甚至用它替换了值

bjh_sti_merge1 <- bjh_sti_merg6 %>% group_by (ID) %>%
  summarise_each(funs(max(., na.rm = TRUE)))

This doesn't work - it freezes R for me, and I always have to force quit it这不起作用 - 它为我冻结了 R,我总是不得不强制退出它

bjh_sti_merg10 <- bjh_sti_merg6 %>% group_by (ID) %>%
  summarise(AGE = max(AGE, na.rm=TRUE),
            LAB1 = max(LAB1, na.rm=TRUE),
            LAB3 = max(LAB3, na.rm=TRUE))

This one doesn't work - it just takes the first row of the duplicated ones (I can't use this because sometimes the first row is NA, and the third row could have 1 in the column) - Also, this seems to freeze R when I have more than 20 columns in it这个不起作用 - 它只占用重复行的第一行(我不能使用它,因为有时第一行是 NA,第三行可能在列中有 1) - 而且,这似乎冻结R 当我有超过 20 列时

xx <-function(x) x[!is.na(x)]

bjh_sti_merg7 %>% 
  group_by(EPIC_MRN) %>%
  summarise_all(funs(xx))

This doesn't work: it says: Error: Problem with 'summarise()' input 'LAB1'.这不起作用:它说:错误:'summarise()'输入'LAB1'有问题。 x Input 'LAB1' must be size 0 or 1, not 2. x 输入“LAB1”的大小必须为 0 或 1,而不是 2。

I want the end result to have 1 row per ID.我希望最终结果每个 ID 有 1 行。 The code needs to work for all columns (character, numeric, timestamps, factors, etc.).该代码需要适用于所有列(字符、数字、时间戳、因子等)。 and something that doesn't freeze RStudio for me.以及对我来说不会冻结 RStudio 的东西。 I was always recommended summarise_each, but that kept freezing my laptop (I tried to let it run, it ran for over 2 hours and nothing) and yes, I have uploaded tidyverse, data.table, and dplyr我总是被推荐 summarise_each,但它一直冻结我的笔记本电脑(我试图让它运行,它运行了 2 多个小时,但什么也没有),是的,我已经上传了 tidyverse、data.table 和 dplyr

This also needs to accept NA as well!这也需要接受 NA !

I would like the dataset to look like: https://i.stack.imgur.com/yBehQ.png我希望数据集看起来像: https://i.stack.imgur.com/yBehQ.png

See if this doesn't work, might take some time to run:看看这是否不起作用,可能需要一些时间才能运行:

plyr::ddply(df, plyr::.(ID), function(x){
  res <- x[1,]
  if(ncol(x) == 1) return(res)
  for (i in 1:ncol(x)) {
    if(class(x[,i]) != "numeric") next()
    res[,i] <- max(x[,i], na.rm=T)
  }
  return(res)
})

This task should be straightforward.这个任务应该很简单。 It is not clear to me though how you wish to summarize the AGE, TIME, LAB1 and LAB2 columns.我不清楚您希望如何总结 AGE、TIME、LAB1 和 LAB2 列。 For simplicity sake I have used max(col, na.rm = TRUE) .为简单起见,我使用了max(col, na.rm = TRUE)

library(dplyr)
library(tibble)

data <- tibble(
  ID = c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7),
  SEX = c("M", "M", "M", "F", "F", "M", "M", "F", "F", "M", "M", "F"),
  AGE = c(30, 30, 30, 22, 22, 55, 90, 87, 87, 23, 23, 45),
  TIME = as.POSIXct(rep("02/19/2019 12:00", 12), format = "%m/%d/%Y %H:%M", tz = ""),
  LAB1 = c(0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0),
  LAB2 = c(1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1)
)

data <- data %>%
  group_by(ID, SEX) %>%
  summarize(AGE = max(AGE, na.rm = TRUE),
            TIME = max(TIME, na.rm = TRUE),
            LAB1 = max(LAB1, na.rm = TRUE),
            LAB2 = max(LAB2, na.rm = TRUE))

With this result:有了这个结果:

> data
# A tibble: 7 x 6
# Groups:   ID [7]
     ID SEX     AGE TIME                 LAB1  LAB2
  <dbl> <chr> <dbl> <dttm>              <dbl> <dbl>
1     1 M        30 2019-02-19 12:00:00     1     1
2     2 F        22 2019-02-19 12:00:00     1     1
3     3 M        55 2019-02-19 12:00:00     1     1
4     4 M        90 2019-02-19 12:00:00     1     1
5     5 F        87 2019-02-19 12:00:00     0     0
6     6 M        23 2019-02-19 12:00:00     1     1
7     7 F        45 2019-02-19 12:00:00     0     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM