简体   繁体   English

如何使用R删除数据较少的重复行?

[英]How to remove duplicated rows with less data with R?

Let's say I have the following data table (as data ): 假设我有以下数据表(作为data ):

row,or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source
1,VA1,VA2,2014-05-24,,0,0,2124,2014-05-22 15:50:16,,,,2014-05-22 12:20:03,tp
2,VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,tp
3,VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,A1,,,2014-05-22 12:20:03,tp
4,VA1,VA2,2014-06-05,,0,0,2124,2014-05-22 15:48:24,,,,2014-05-22 12:20:03,tp
5,VA1,VA2,2014-06-09,,0,0,2124,2014-05-22 15:37:35,,,,2014-05-22 12:20:03,tp
6,VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp
7,VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp

I would like to delete duplicated rows. 我想删除重复的行。 If I do data <- unique(data, by = NULL) , then only the last row (row 7) is deleted, but I would like to delete row 2 also. 如果我执行data <- unique(data, by = NULL) ,则仅删除最后一行(第7行),但是我也想删除第2行。 I can define keys with setkey() : 我可以使用setkey()定义键:

setkey(data, row,or,d,ddate,rdate,changes,class,price,fdate,number,minutes,added,source)

, and it will delete either row 2 or row 3. But I would like to delete rows, which has less data and keep rows with more data. ,它将删除第2行或第3行。但是我想删除数据较少的行,并保留数据较多的行。 Ie in the case above, row 2 should be deleted, but row 3 should remain since it has additional value in column company . 即在上述情况下,应删除第2行,但应保留第3行,因为它在列company具有附加值。 How can I do it? 我该怎么做?

How about this: 这个怎么样:

# whatever the important columns are for your uniqueness criterion
important.cols = c('or','d','ddate','rdate','changes','class','price','fdate')

# pick row with max number of non-empty elements
dt[, .SD[which.max(rowSums(.SD != "", na.rm = T))], by = important.cols]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM