I have a data table that is similar to the following but has hundreds of columns and millions of rows:
Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3
one 1 1 1 3 5 2 1 1 2 5 5
two 40 40 40 2 6 2 4 4 2 6 6
three 20 20 20 5 7 5 6 6 5 8 9
four 30 30 31 1 6 1 2 2 1 6 6
I want to do a row-wise comparison of equality for all variables containing the same prefix, but am not entirely sure of the best way to do so. Thus far I have pulled a vector of column indices with the same prefix in their column names using the coding below as follows:
combo1<-grep(pattern="^A", x=colnames(data))
combo2<-grep(pattern="^B", x=colnames(data))
combo3<-grep(pattern="^C", x=colnames(data))
combo4<-grep(pattern="^D", x=colnames(data))
I believe the next step would be to iterate through the columns of the data table in each vector and compare the row values to see if they are equal, but I'm not sure the best way to do this.
I would like the end result to add on columns for each vector of columns with matching prefixes that conveys if the combination is equal or not, as follows:
Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
one 1 1 1 3 5 2 1 1 2 5 5 TRUE FALSE TRUE TRUE
two 40 40 40 2 6 2 4 4 2 6 6 TRUE TRUE TRUE TRUE
three 20 20 20 5 7 5 6 6 5 8 9 TRUE TRUE FALSE TRUE
four 30 30 31 1 6 1 2 2 1 6 6 FALSE TRUE TRUE TRUE
What is the best way to go about this?
Here is an option with split.default
where we split the data.frame into specific sets of smaller data.frame based on the pattern in the column name ie extracting the substring without the _
and the digits afterwards). Then loop over the list
of data.frame
with lapply
, get the rowSums
by comparing the first column with the full dataset and checking if it is equal to the number of columns ie if we have a single unique element on a row, it will give TRUE or else FALSE
lst1 <- lapply(split.default(df1[-1], sub("_\\d+", "", names(df1)[-1])),
function(x)
rowSums(x == x[,1])== ncol(x))
df1[paste0(names(lst1), "_equal")] <- lst1
-output
df1
# Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
#1 one 1 1 1 3 5 2 1 1 2 5 5 TRUE FALSE TRUE TRUE
#2 two 40 40 40 2 6 2 4 4 2 6 6 TRUE TRUE TRUE TRUE
#3 three 20 20 20 5 7 5 6 6 5 8 9 TRUE TRUE FALSE TRUE
#4 four 30 30 31 1 6 1 2 2 1 6 6 FALSE TRUE TRUE TRUE
Or an option with tidyverse
where we reshape the data into 'long' format with pivot_longer
, then do a group_by
, and check across
the columns whether we have a single unique element ( n_distinct
) and join the output with the original dataset by 'Name'
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = -Name, names_to = c(".value", 'grp'),
names_sep="_") %>%
group_by(Name) %>%
summarise(across(A:D, ~ n_distinct(., na.rm = TRUE) == 1,
.names = '{.col}_equal'),
.groups = 'drop') %>%
left_join(df1, .)
-output
# Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
#1 one 1 1 1 3 5 2 1 1 2 5 5 TRUE FALSE TRUE TRUE
#2 two 40 40 40 2 6 2 4 4 2 6 6 TRUE TRUE TRUE TRUE
#3 three 20 20 20 5 7 5 6 6 5 8 9 TRUE TRUE FALSE TRUE
#4 four 30 30 31 1 6 1 2 2 1 6 6 FALSE TRUE TRUE TRUE
Or with data.table
, the logic is similar to tidyverse, where we use melt
instead of pivot_longer
for reshaping into 'long' format and then do group by 'Name', loop over the Subset of Data.table ( .SD
) with lapply
, check the unique with uniqueN
, convert to logical == 1
and join on
the 'Name' column
library(data.table)
setDT(df1)[melt(df1, measure = patterns('^A_\\d+$', '^B_\\d+', '^C_\\d+$', '^D_\\d+$'),
value.name = paste0(LETTERS[1:4], '_equal'))[,
lapply(.SD, function(x) uniqueN(x, na.rm = TRUE) == 1),
.(Name), .SDcols = patterns('equal$')], on = .(Name)]
# Name A_1 A_2 A_3 B_1 C_1 B_2 D_1 D_2 B_3 C_2 C_3 A_equal B_equal C_equal D_equal
#1: one 1 1 1 3 5 2 1 1 2 5 5 TRUE FALSE TRUE TRUE
#2: two 40 40 40 2 6 2 4 4 2 6 6 TRUE TRUE TRUE TRUE
#3: three 20 20 20 5 7 5 6 6 5 8 9 TRUE TRUE FALSE TRUE
#4: four 30 30 31 1 6 1 2 2 1 6 6 FALSE TRUE TRUE TRUE
df1 <- structure(list(Name = c("one", "two", "three", "four"), A_1 = c(1L,
40L, 20L, 30L), A_2 = c(1L, 40L, 20L, 30L), A_3 = c(1L, 40L,
20L, 31L), B_1 = c(3L, 2L, 5L, 1L), C_1 = c(5L, 6L, 7L, 6L),
B_2 = c(2L, 2L, 5L, 1L), D_1 = c(1L, 4L, 6L, 2L), D_2 = c(1L,
4L, 6L, 2L), B_3 = c(2L, 2L, 5L, 1L), C_2 = c(5L, 6L, 8L,
6L), C_3 = c(5L, 6L, 9L, 6L)), class = "data.frame", row.names = c(NA,
-4L))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.