简体   繁体   中英

Divide rows into groups given the similarity between them

Given this example data frame:

DF <- data.frame(x = c(1, 0.85, 0.9, 0, 0, 0.9, 0.95),
             y = c(0, 0, 0.1, 0.9, 1, 0.9, 0.97), 
             z = c(0, 0, 0, 0.9, 0.9, 0.0, 0.9 ))

I am trying to assign each row to a group containing rows adjacent to one another, based on their similarity. I would like to use a cutoff of 0.35, meaning that consecutive rows of values c(1, 0.85, 0.7) can be assigned to one group, but c(0, 1, 0) cannot. Regarding the columns, column-to-column differences are not important ie c(1, 1, 1) and c(0, 0, 0) could still be assigned to one group, HOWEVER, if rows in one column meet the criteria (eg c(1, 1, 1)) but the rows in another column(s) do not (eg c(1, 0, 1)) - the row is invalid.

Here is the desired output for the example I gave above:

[1]  1  1  1  2  2 NA NA

I am currently applying the abs(diff()) function to determine the difference between the values, and then for each row I take the largest value (adding 1 at the beginning to account for the first row):

diff <- apply(DF, MARGIN = 2, function (x) abs(diff(x)))
max_diff <- c(1, apply(diff, MARGIN = 1, function (x) max(x, na.rm = T)))

max_diff
[1] 1.00 0.15 0.10 0.90 0.10 0.90 0.90

I am stuck at this point, not quite sure what is the best way to proceed with the group assignment. I was initially trying to convert max_diff into a logical vector (max diff < 0.35), and then running a for loop grouping all the TRUEs together. This has a couple of problems:

  1. My dataset has millions of rows so the forloop takes ages,
  2. I "ignore" the first component of the group - eg I would not consider the first row as a member of the first group, because the max_diff value of 1 gives FALSE. I don't want to ignore anything.

I will be very grateful for any advice on how to proceed in an efficient way.

PS. The way of determining the difference between sites is not crucial - here it is just a difference of 0.35 but this is very flexible. All I am after is an adjustable method of finding similar rows.

You could do a cluster analysis and play around with different cutoffs h .

cl <- hclust(dist(DF))
DF$group <- cutree(cl, h=.5)

DF
#      x    y   z group
# 1 1.00 0.00 0.0     1
# 2 0.85 0.00 0.0     1
# 3 0.90 0.10 0.0     1
# 4 0.00 0.90 0.9     2
# 5 0.00 1.00 0.9     2
# 6 0.90 0.90 0.0     3
# 7 0.95 0.97 0.9     4

A dendrogram helps to determine h .

plot(cl)
abline(h=.5, col=2)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM