简体   繁体   English

r在散点图中识别两个群体

[英]r identify two populations in scatterplot

I am comparing two rasters with a simple scatter plot of cell-by-cell plot, and find that I have two seemingly different populations: 我将两个栅格与逐个细胞图的简单散点图进行比较,发现我有两个看似不同的种群:

真实的散点图

Now I am trying to extract the locations of each of these populations (by isolating the row IDs, eg) so I can see where they fall in the rasters and maybe understand why I get this behavior. 现在我试图提取每个群体的位置(通过隔离行ID,例如),这样我就可以看到它们落入栅格的位置,也许可以理解为什么我会得到这种行为。 Here is a reproducible example: 这是一个可重复的例子:
X <- seq(1,1000,1)
Z <- runif(1000, 1, 2)
A = c(1.2 * X * Z + 100)
B = c(0.6 * X * Z )
df = data.frame(X = c(X,X), Y = c(A,B))
plot(df$X,df$Y)
样本散点图
Also, my original data has some 1,000,000 rows, so the solution needs to support a large data frame as well. 此外,我的原始数据有大约1,000,000行,因此解决方案也需要支持大型数据帧。 Any ideas on how I can isolate each of these groups? 关于如何隔离这些群体的任何想法?
Thanks 谢谢

Spectral Clustering is useful in identifying clusters of points that has a clear boundary. 谱聚类可用于识别具有清晰边界的点群。 A great advantage is that it is unsupervised, ie not relying much on human judgement, although the method is slow and some hyperparameters (eg number of clusters) need to be supplied. 一个很大的优点是它是无监督的,即不依赖于人类判断,尽管该方法很慢并且需要提供一些超参数(例如,簇的数量)。

Below is the code for clustering. 下面是群集的代码。 The code takes about a few minutes in your case. 在您的情况下,代码大约需要几分钟。

library(kernlab)
specc_df <- specc(as.matrix(df),centers = 2)
plot(df, col = specc_df)

The result is an obvious plot of two clusters of points. 结果是两个点集的明显图。 显然有两组积分

You data has a linear separating line. 您的数据具有线性分隔线。 You can find it with: 你可以找到它:

plot(df$X,df$Y)
Pts = locator(2)

You will want to click on one point between the two groups down by the origin and another on the far right (between the groups). 您将需要单击两个组之间的一个点,按原点向下,另一个点在最右侧(组之间)。 With your data I got 根据你的数据我得到了

Pts
$x
[1]   0.8066296 994.9723687
$y
[1]   48.56932 1255.32870

## Slope
(Pts$y[2] - Pts$y[1]) / (Pts$x[2] - Pts$x[1])
[1] 1.213841

## Draw the line to confirm 
abline(48,1.2, col="red")

## use the line to distinguish the groups
Group = rep(1, nrow(df))
Group[df$X*1.2 + 48 < df$Y] = 2
plot(df, pch=20, col=Group)

情节

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM