简体   繁体   English

通过R data.table中的ID删除重复的行,但添加一个新列,并将其连接的日期与另一列

[英]Remove duplicated rows by ID in R data.table, but add a new column with the concatenated dates from another column

I have a large data table of patient data. 我有一个很大的病人数据表。 I want to delete rows where "id" is duplicated without losing the information in the "date" column. 我想删除“ id”重复的行,而不会丢失“ date”列中的信息。

id  date
01  2004-07-01
02  NA
03  2013-11-15
03  2005-03-15
04  NA
05  2011-07-01
05  2012-07-01

I could do this one of two ways - 我可以通过以下两种方式之一进行操作-

  1. create a column that writes over the date column values to concatenate all the dates for that ID, ie: 创建一列,以覆盖date列的值以连接该ID的所有日期,即:

     id date_new 01 2004-07-01 02 NA 03 2013-11-15; 2005-03-15 04 NA 05 2011-07-01; 2012-07-01 

or 要么

  1. create one new column for each additional date, ie: 为每个其他日期创建一个新列,即:

     id date_new date_new2 01 2004-07-01 NA 02 NA NA 03 2013-11-15 2005-03-15 04 NA NA 05 2011-07-01 2012-07-01 

I have tried a few things, but they keep crashing my R session (I get the message R Session Aborted. R encountered a fatal error. The session was terminated. ): 我已经尝试了一些方法,但是它们仍然使我的R会话崩溃(我收到消息R Session Aborted. R encountered a fatal error. The session was terminated. ):

setkey(DT, "id")
unique_DT <- subset(unique(DT))

and: 和:

DT[!duplicated(DT[, "id", with = FALSE])]

However, besides crashing R, neither of these solutions does what I want with the dates. 但是,除了崩溃R之外,这些解决方案都不符合我想要的日期。

Any ideas? 有任何想法吗? I am new to data table (and R generally) but I have the vague sense that I could solve this with := somehow. 我是数据表(通常是R)的新手,但是我有模糊的感觉,我可以使用:=来解决这个问题。

尝试这个:

dt[,c(date_new=paste(date,collapse="; "),.SD),by=id]

You can use the aggregate function and it should do what you want. 您可以使用聚合函数,它应该执行您想要的操作。 I was having some trouble with the dates switching to factors, but it seems like enclosing the date string with I() keeps it as a character. 我在将日期转换为因子时遇到了一些麻烦,但是好像用I()括起日期字符串会使它保持为字符。

id=c(1,2,3,3,4,5,5)
date = c("2004-07-01","NA","2013-11-15","2005-03-15","NA",
         "2011-07-01","2012-07-01")
data=as.data.frame(list(id=id,date=date))

data$date=as.character(data$date)

aggregate(list(date = I(data$date)),by=list(id = data$id),c)

  id                   date
1  1             2004-07-01
2  2                     NA
3  3 2013-11-15, 2005-03-15
4  4                     NA
5  5 2011-07-01, 2012-07-01

edit: used the aggregate function but used paste instead of c. 编辑:使用了聚合函数,但使用了粘贴而不是c。 Changing the collapse option to ";" 将折叠选项更改为“;” should solve the separator problem 应该解决分隔符问题

newdata = aggregate(list(date = I(data$date)),
                    by=list(id = data$id),
                    function(x){paste(unique(x),collapse=";")})
newdata


  id                  date
1  1            2004-07-01
2  2                    NA
3  3 2013-11-15;2005-03-15
4  4                    NA
5  5 2011-07-01;2012-07-01

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R data.table删除如果另一列不适用的情况下重复一列的行 - R data.table remove rows where one column is duplicated if another column is NA R使用旧data.table中的单列指定行创建新的data.table - R Creating new data.table with specified rows of a single column from an old data.table 如何将一个data.table中的列中的某些行添加到标题下的另一个data.table中? - How to add certain rows from a column in one data.table to another data.table under a heading? 在data.table中为R中的新列选择日期 - Selecting dates in a data.table for new column in r 通过引用data.table r中的列值来删除行 - remove rows by reference to column values in data.table r 如何删除r中data.table中的所有重复行 - How to remove all duplicated rows in data.table in r R data.table添加新列与查询的每一行 - R data.table add new column with query for each row R,data.table:使用计算出的信息和另一个数据表中的名称以编程方式添加列 - R, data.table: programatically add column using calculated info and name from another data table 从data.table中删除一列上相同但在另一列上不同的行 - remove rows that are same on one column but different on another from a data.table R:在 r 中创建一个新列,其中包含来自另一列的连接值 - R: Create a new column in r with concatenated values from another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM