简体   繁体   中英

Identify duplicated rows based on multiple columns and specific value in another column in very large matrix with for loop

I have a large matrix called data of 10,864 rows and 134 columns.

The first 4 columns are parameters which make every row unique. The data from 5th to 134th column for all rows are numbers between 1 and 20.

在此处输入图像描述

在此处输入图像描述

I am running a for loop in the matrix to insert NA into certain cells of the matrix. This needs to be done on the basis of unique values from Columns OrgID , rank and scorei only if value in same row for column score(i+12) != 1 .

Hence, I run a for loop from column 5 to 134 and where there is duplication based on these three columns and value in score(i+12) column value is not equal to 1, I insert NA into that cell of matrix.

for(i in 5:ncol(data){
data[which(duplicated(data[,c(1,4,i)]) & (data[,i+12])!=1),i] <- "NA"
} 

This code, however, gives the wrong output by inserting NA only where there is duplicated value on the basis of 1 st, 4 th and i th column ie equivalent result to running the following code:

for(i in 5:ncol(data){
    data[which(duplicated(data[,c(1,4,i)])),i] <- "NA"
    }   

How do make it to perform the required operation only when value in column score(i+12) !=1 in the duplicated rows.

To make it simpler to see the failed output, I have highlighted a few rows and the relevant columns to show how this works when applied to the column 118 ie i =118 here.

在此处输入图像描述

For example, based on the above explained logic, there is duplication in OrgID=5659 . The duplication based on OrgID, rank and score118 identifies these 2 rows with one row showing value in score130=1 and other score130=16 . Hence, in the row with score130=16 should be now NA according to the logic. But this remains unchanged at 16 .

Maybe you can try

for(i in 5:(ncol(data) - 12)) {
   inds <- duplicated(data[c(1,4,i)]) | duplicated(data[c(1,4,i)], fromLast = TRUE)
   data[inds & data[[i + 12]] != 1, i + 12] <- NA
} 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM