简体   繁体   中英

R data.table testing the rowwise equality of a vector of column indices

I have a data table that is similar to the following but has hundreds of columns and millions of rows:

Name  A_1  A_2  A_3  B_1  C_1  B_2  D_1  D_2  B_3  C_2  C_3  
one   1    1    1    3     5   2    1     1   2    5    5    
two   40   40   40   2     6   2    4     4   2    6    6    
three 20   20   20   5     7   5    6     6   5    8    9    
four  30   30   31   1     6   1    2     2   1    6    6    

I want to do a row-wise comparison of equality for all variables containing the same prefix, but am not entirely sure of the best way to do so. Thus far I have pulled a vector of column indices with the same prefix in their column names using the coding below as follows:

combo1<-grep(pattern="^A", x=colnames(data))
combo2<-grep(pattern="^B", x=colnames(data))
combo3<-grep(pattern="^C", x=colnames(data))
combo4<-grep(pattern="^D", x=colnames(data))

I believe the next step would be to iterate through the columns of the data table in each vector and compare the row values to see if they are equal, but I'm not sure the best way to do this.

I would like the end result to add on columns for each vector of columns with matching prefixes that conveys if the combination is equal or not, as follows:

Name  A_1  A_2  A_3  B_1  C_1  B_2  D_1  D_2  B_3  C_2  C_3  A_equal  B_equal  C_equal  D_equal
one   1    1    1    3     5   2    1     1   2    5    5    TRUE     FALSE    TRUE     TRUE
two   40   40   40   2     6   2    4     4   2    6    6    TRUE     TRUE     TRUE     TRUE
three 20   20   20   5     7   5    6     6   5    8    9    TRUE     TRUE     FALSE    TRUE
four  30   30   31   1     6   1    2     2   1    6    6    FALSE    TRUE     TRUE     TRUE

What is the best way to go about this?

Here is an option with split.default where we split the data.frame into specific sets of smaller data.frame based on the pattern in the column name ie extracting the substring without the _ and the digits afterwards). Then loop over the list of data.frame with lapply , get the rowSums by comparing the first column with the full dataset and checking if it is equal to the number of columns ie if we have a single unique element on a row, it will give TRUE or else FALSE

lst1 <- lapply(split.default(df1[-1], sub("_\\d+", "", names(df1)[-1])), 
              function(x)
           rowSums(x ==  x[,1])== ncol(x))
df1[paste0(names(lst1), "_equal")] <- lst1

-output

df1
#   Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
#1   one   1   1   1   3   5   2   1   1   2   5   5    TRUE   FALSE    TRUE    TRUE
#2   two  40  40  40   2   6   2   4   4   2   6   6    TRUE    TRUE    TRUE    TRUE
#3 three  20  20  20   5   7   5   6   6   5   8   9    TRUE    TRUE   FALSE    TRUE
#4  four  30  30  31   1   6   1   2   2   1   6   6   FALSE    TRUE    TRUE    TRUE

Or an option with tidyverse where we reshape the data into 'long' format with pivot_longer , then do a group_by , and check across the columns whether we have a single unique element ( n_distinct ) and join the output with the original dataset by 'Name'

library(dplyr)
library(tidyr)
pivot_longer(df1, cols = -Name, names_to = c(".value", 'grp'), 
      names_sep="_") %>%
     group_by(Name) %>% 
     summarise(across(A:D, ~ n_distinct(., na.rm = TRUE) == 1,
          .names = '{.col}_equal'),
     .groups = 'drop') %>%
     left_join(df1, .)

-output

#  Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
#1   one   1   1   1   3   5   2   1   1   2   5   5    TRUE   FALSE    TRUE    TRUE
#2   two  40  40  40   2   6   2   4   4   2   6   6    TRUE    TRUE    TRUE    TRUE
#3 three  20  20  20   5   7   5   6   6   5   8   9    TRUE    TRUE   FALSE    TRUE
#4  four  30  30  31   1   6   1   2   2   1   6   6   FALSE    TRUE    TRUE    TRUE

Or with data.table , the logic is similar to tidyverse, where we use melt instead of pivot_longer for reshaping into 'long' format and then do group by 'Name', loop over the Subset of Data.table ( .SD ) with lapply , check the unique with uniqueN , convert to logical == 1 and join on the 'Name' column

library(data.table)
setDT(df1)[melt(df1, measure = patterns('^A_\\d+$', '^B_\\d+', '^C_\\d+$', '^D_\\d+$'),
   value.name = paste0(LETTERS[1:4], '_equal'))[, 
   lapply(.SD, function(x) uniqueN(x, na.rm = TRUE) == 1),
   .(Name), .SDcols = patterns('equal$')], on = .(Name)]
#    Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
#1:   one   1   1   1   3   5   2   1   1   2   5   5    TRUE   FALSE    TRUE    TRUE
#2:   two  40  40  40   2   6   2   4   4   2   6   6    TRUE    TRUE    TRUE    TRUE
#3: three  20  20  20   5   7   5   6   6   5   8   9    TRUE    TRUE   FALSE    TRUE
#4:  four  30  30  31   1   6   1   2   2   1   6   6   FALSE    TRUE    TRUE    TRUE

data

df1 <- structure(list(Name = c("one", "two", "three", "four"), A_1 = c(1L, 
40L, 20L, 30L), A_2 = c(1L, 40L, 20L, 30L), A_3 = c(1L, 40L, 
20L, 31L), B_1 = c(3L, 2L, 5L, 1L), C_1 = c(5L, 6L, 7L, 6L), 
    B_2 = c(2L, 2L, 5L, 1L), D_1 = c(1L, 4L, 6L, 2L), D_2 = c(1L, 
    4L, 6L, 2L), B_3 = c(2L, 2L, 5L, 1L), C_2 = c(5L, 6L, 8L, 
    6L), C_3 = c(5L, 6L, 9L, 6L)), class = "data.frame", row.names = c(NA, 
-4L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM