简体   繁体   English

使用R在数据框中选择信息最多的行

[英]Select the rows with most available information in a data frame with R

I would like to select the rows with the highest number of information in a data frame. 我想选择数据框中信息数量最多的行。 This data frame is generated automaticaly so the name of columns increase over the time. 此数据帧是自动生成的,因此列的名称会随时间增加。

the data are like. 数据就像。

Player  V1  F1  V2  F2  V3  F3  V4  F4
111111  0   0   1   3   0   0   1   3
111111  0   0   1   3   1   3   1   3
222222  3   4   0   0   3   4   3   4
222222  3   4   3   4   3   4   3   4
33333   1   2   1   2   1   2   1   2
33333   1   2   1   2   1   2   0   0

and it should be: 它应该是:

Player  V1  F1  V2  F2  V3  F3  V4  F4
111111  0   0   1   3   1   3   1   3
222222  3   4   3   4   3   4   3   4
33333   1   2   1   2   1   2   1   2

the idea is to select the rows with the most complete information. 这个想法是选择信息最完整的行。 I'm considering 0 as incomplete information 我将0视为不完整的信息

You mentioned the data frame is generated automatically so the name of columns increase over the time. 您提到数据框是自动生成的,因此列的名称会随时间增加。 Is it real time grouping you are trying to do ? 您要进行实时分组吗?

This data.table approach below should be good to group the Player column accordingly and select the max value. 下面的这种data.table方法应该很好地对Player列进行相应的分组并选择最大值。 It works for the representative example you gave. 它适用于您给出的代表性示例。 This is similar to the answer provided @arun here. 这类似于此处@arun提供的答案。 Group by one column, select row with minimum in one column for every pair of columns in R 按一列分组,为R中的每对列选择在一列中最少的行

require (data.table)
dt <- as.data.table(df)
# Get the column names
my_cols <- c("V1","F1","V2","F2","V3","F3","V4","F4")  

# Map applies function and subset across all the columns passed
# as vector my_cols, and mget return value of the named object

# data.table expression written in general form for understanding DT[i, j, by]
# missing i implies "on all rows".
# this expression computes the expression in 'j' grouped by 'Player'
dt[, Map(`[`, mget(my_cols), lapply(mget(my_cols), which.max)), by = Player]
#    Player V1 F1 V2 F2 V3 F3 V4 F4
# 1: 111111  0  0  1  3  1  3  1  3
# 2: 222222  3  4  3  4  3  4  3  4
# 3:  33333  1  2  1  2  1  2  1  2

As already pointed out by @Imo and @evan058, it's not clear what "most complete information" means. 正如@Imo和@ evan058所指出的,尚不清楚“最完整的信息”是什么意思。 I assume you consider a 0 to be missing information, consequently that "most complete" refers to the entry with the least 0 entries per player: 我假设您认为0缺少信息,因此“最完整”是指每个玩家的条目最少为0条目:

This snippet should do the job then: 然后,此代码段应完成以下工作:

library(plyr)
newData <- ldply(unique(data$Player), function(player) {
  tmp <- data[data$Player == player,]
  tmp[which.max(rowSums(tmp[,-1] != 0)),]
})
print(newData)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM