Identify duplicated rows based on multiple columns and specific value in another column in very large matrix with for loop

Question

I have a large matrix called data of 10,864 rows and 134 columns.

The first 4 columns are parameters which make every row unique. The data from 5th to 134th column for all rows are numbers between 1 and 20.

I am running a for loop in the matrix to insert NA into certain cells of the matrix. This needs to be done on the basis of unique values from Columns OrgID , rank and scorei only if value in same row for column score(i+12) != 1 .

Hence, I run a for loop from column 5 to 134 and where there is duplication based on these three columns and value in score(i+12) column value is not equal to 1, I insert NA into that cell of matrix.

for(i in 5:ncol(data){
data[which(duplicated(data[,c(1,4,i)]) & (data[,i+12])!=1),i] <- "NA"
}

This code, however, gives the wrong output by inserting NA only where there is duplicated value on the basis of 1 st, 4 th and i th column ie equivalent result to running the following code:

for(i in 5:ncol(data){
    data[which(duplicated(data[,c(1,4,i)])),i] <- "NA"
    }

How do make it to perform the required operation only when value in column score(i+12) !=1 in the duplicated rows.

To make it simpler to see the failed output, I have highlighted a few rows and the relevant columns to show how this works when applied to the column 118 ie i =118 here.

For example, based on the above explained logic, there is duplication in OrgID=5659 . The duplication based on OrgID, rank and score118 identifies these 2 rows with one row showing value in score130=1 and other score130=16 . Hence, in the row with score130=16 should be now NA according to the logic. But this remains unchanged at 16 .

Answer 1

Maybe you can try

for(i in 5:(ncol(data) - 12)) {
   inds <- duplicated(data[c(1,4,i)]) | duplicated(data[c(1,4,i)], fromLast = TRUE)
   data[inds & data[[i + 12]] != 1, i + 12] <- NA
}

Identify duplicated rows based on multiple columns and specific value in another column in very large matrix with for loop

Question

1 answers

solution1
1 ACCPTED 2019-09-23 11:34:24

Identify duplicated rows based on multiple columns and specific value in another column in very large matrix with for loop

Question

1 answers

solution1 1 ACCPTED 2019-09-23 11:34:24

solution1
1 ACCPTED 2019-09-23 11:34:24