简体   繁体   中英

How to subset a data.frame by a row in the same data.frame in R?

I would like to subset my data.frame so that it only returns the rows that have at least 50% of their values <= the values in a specific row in the data.frame.

df

Name   A   B   C   D
r1     2   2   2   2
r2     4   3   1   3
r3     1   1   1   2
r4     3   3   3   1

The specific row I am trying to subset by is row r1 . I only want to return row r3 since since 75% of the values are <= to the values in row r1 .

df

Name   A   B   C   D
r3     1   1   1   2

Any help will be appreciated. Please let me know if further information is needed.

Add up the number of conditions met on a row by row basis using "+" and compare to 3:

subset(df, ( (A <= A[1]) + (B <= B[1]) + (C <= C[1]) + (D <= D[1]) ) >= 3 )

> subset(df, ( (A <= A[1]) + (B <= B[1]) + (C <= C[1]) + (D <= D[1]) ) >= 3 )
  Name A B C D
1   r1 2 2 2 2
3   r3 1 1 1 2

If you want to also remove 'r1' then just append [-1, ]

This can be generalized to provide a numeric vector that can be tested against a percentage criterion; it give the number of items in each row that are less than their counterparts in the first row. I needed to unlist the first row because using the third argument as a single row dataframe failed:

rowSums(sweep(df[-1], 2, unlist(df[1,-1]), "<="))
[1] 4 2 4 2

Below is a demonstration:

df2 <- cbind(nms = paste0("r", 1:10), 
             as.data.frame( matrix(sample( 1:10, 200,repl=TRUE), 10) ) )
df2
#--------------
nms V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1   r1  8  6 10  7  3  7  2  8  4   9   9   4   5   4   8   7   2   1   6   4
2   r2  3  9  6  3  9 10  6 10 10   3   3   2   4   4   4  10   3   5   2   1
3   r3  1  7  6  8  3  5  2  3  1   5   5   4   8   3   1   6   2  10   3   7
4   r4  2  6 10 10  8  7  9  1  4   5   6   7   2   6   8   3   5  10  10   3
5   r5  5  5  7  2  5 10  2  9  2   9   4   6   1   5   8   5   8   6   3   5
6   r6  4  1  7  7  6  9  6  3  4   3   2   9   4   8  10   3   4   4  10   4
7   r7  7  1 10  4  1  2  8  5  8   8   5   5   5   6   4  10   6   9  10   6
8   r8 10  8  1  4  1  4 10  3  1   3  10   3   4   9   4   7   4   9   2   2
9   r9  3 10  9  1 10  8  8  4  7   2   7   2   9  10   3   3   7   4  10   1
10 r10  4  7  3  3  1  9  4  1  9   5   3   9   9   3   9   2   9  10   2   4
#-----------------
rowSums(sweep(df2[-1], 2, unlist(df2[1,-1]), "<="))
# [1] 20 11 15 12 12 11 11 13 10 11
rowSums(sweep(df2[-1], 2, unlist(df2[1,-1]), "<=")) >= 20*0.75
# [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

 df2[ rowSums(sweep(df2[-1], 2, unlist(df2[1,-1]), "<=")) >= 20*0.75 , ]
#---------
  nms V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1  r1  8  6 10  7  3  7  2  8  4   9   9   4   5   4   8   7   2   1   6   4
3  r3  1  7  6  8  3  5  2  3  1   5   5   4   8   3   1   6   2  10   3   7

It's occurred to me that an apply solution would probably have seemed more obvious to some R programmers:

 colSums( apply(df2[-1], 1, "<=", df2[1,-1]) ) >= ncol(df2)*.7

Note the need to use colSums because of the way that `apply returns a matrix in column-oriented fashion, sometimes a puzzle to beginneRs.

Here is the generic solution that can also be applied for 34 variables:

Assumption: In the dataset, we are comparing every column except the first column which stores Name .

> col_names <- colnames(df)[-1]

> index <- which(df$Name == 'r1')
> values <- seq(1:nrow(df))[-index]

> row_num <- integer(0)
> for (i in values){
+ min_val <- length(col_names) / 2
+ if (length(which(df[i,col_names] <= df[index,col_names])) >= min_val)
+ row_num <- c(row_num,i)
+ }

> df[row_num,]
  Name A B C D
3   r3 1 1 1 2

Though if the dataset is large, it might take some time. You can improve the performance with the help of data.table package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM