Divide rows into groups given the similarity between them

Question

Given this example data frame:

DF <- data.frame(x = c(1, 0.85, 0.9, 0, 0, 0.9, 0.95),
             y = c(0, 0, 0.1, 0.9, 1, 0.9, 0.97), 
             z = c(0, 0, 0, 0.9, 0.9, 0.0, 0.9 ))

I am trying to assign each row to a group containing rows adjacent to one another, based on their similarity. I would like to use a cutoff of 0.35, meaning that consecutive rows of values c(1, 0.85, 0.7) can be assigned to one group, but c(0, 1, 0) cannot. Regarding the columns, column-to-column differences are not important ie c(1, 1, 1) and c(0, 0, 0) could still be assigned to one group, HOWEVER, if rows in one column meet the criteria (eg c(1, 1, 1)) but the rows in another column(s) do not (eg c(1, 0, 1)) - the row is invalid.

Here is the desired output for the example I gave above:

[1]  1  1  1  2  2 NA NA

I am currently applying the abs(diff()) function to determine the difference between the values, and then for each row I take the largest value (adding 1 at the beginning to account for the first row):

diff <- apply(DF, MARGIN = 2, function (x) abs(diff(x)))
max_diff <- c(1, apply(diff, MARGIN = 1, function (x) max(x, na.rm = T)))

max_diff
[1] 1.00 0.15 0.10 0.90 0.10 0.90 0.90

I am stuck at this point, not quite sure what is the best way to proceed with the group assignment. I was initially trying to convert max_diff into a logical vector (max diff < 0.35), and then running a for loop grouping all the TRUEs together. This has a couple of problems:

My dataset has millions of rows so the forloop takes ages,
I "ignore" the first component of the group - eg I would not consider the first row as a member of the first group, because the max_diff value of 1 gives FALSE. I don't want to ignore anything.

I will be very grateful for any advice on how to proceed in an efficient way.

PS. The way of determining the difference between sites is not crucial - here it is just a difference of 0.35 but this is very flexible. All I am after is an adjustable method of finding similar rows.

Answer 1

You could do a cluster analysis and play around with different cutoffs h .

cl <- hclust(dist(DF))
DF$group <- cutree(cl, h=.5)

DF
#      x    y   z group
# 1 1.00 0.00 0.0     1
# 2 0.85 0.00 0.0     1
# 3 0.90 0.10 0.0     1
# 4 0.00 0.90 0.9     2
# 5 0.00 1.00 0.9     2
# 6 0.90 0.90 0.0     3
# 7 0.95 0.97 0.9     4

A dendrogram helps to determine h .

plot(cl)
abline(h=.5, col=2)

Divide rows into groups given the similarity between them

Question

1 answers

solution1
1 ACCPTED 2020-12-14 16:18:51

Divide rows into groups given the similarity between them

Question

1 answers

solution1 1 ACCPTED 2020-12-14 16:18:51

solution1
1 ACCPTED 2020-12-14 16:18:51