简体   繁体   English

R:找出每组直线上有多少点

[英]R: Find out how many points lay on the straight line per group

i am trying to find the solution to my problem: 我试图找到解决我的问题的方法:

how many points per group lay on the straight line 每组多少点在直线上

I could not find any solution for this problem in R... 我在R中找不到针对此问题的任何解决方案...

Below You have a sample data and as well plot just to show you how does it look like: 下面您有一个示例数据以及图表,旨在向您展示它的外观:

data <- structure(list(Group = c(22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 11553L, 11553L, 11553L, 11553L, 11553L, 7059L, 7059L, 
7059L, 7059L, 22782L), x = c(100L, 150L, 250L, 287L, 312L, 387L, 
475L, 550L, 837L, 937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L, 
1662L, 1700L, 1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L, 
4762L, 5362L, 5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L, 
7800L, 7937L, 7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L, 
9312L, 9400L, 9600L, 4637L, 900L, 4187L, 5800L, 7075L, 1125L, 
3400L, 3562L, 3462L, 5412L), y = c(493L, 482L, 479L, 476L, 481L, 
479L, 474L, 480L, 480L, 491L, 489L, 490L, 485L, 485L, 485L, 479L, 
482L, 482L, 482L, 482L, 484L, 489L, 491L, 489L, 496L, 498L, 500L, 
0L, 498L, 500L, 502L, 506L, 497L, 0L, 495L, 506L, 497L, 494L, 
498L, 500L, 496L, 499L, 496L, 495L, 495L, 498L, 825L, 284L, 850L, 
360L, 790L, 861L, 883L, 882L, 881L, 502L)), row.names = c(23L, 
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 
37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 
51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 
64L, 65L, 66L, 67L, 68L, 69L, 281L, 312L, 313L, 315L, 316L, 377L, 
378L, 380L, 511L, 815L), class = "data.frame")

Data consist of group name column (3 Groups in this case), x and y coordinates: 数据由组名列(在这种情况下为3个组),x和y坐标组成:

 Group   x   y
22782 100 493
22782 150 482
22782 250 479
22782 287 476
22782 312 481

Below we can find a plot of the group 22782: 下面我们可以找到22782组的图: 在此处输入图片说明

As You can see there are many points that lay almost exactly on the same line and i would like to find out how many of them per group correspond to this condition. 如您所见,有许多点几乎完全位于同一条线上,我想找出每组中有多少个点对应于这种情况。

Expected Output would look like this: 预期输出如下所示:

  Group Max Points  
  22782  20

I would appreciate any help or tips! 我将不胜感激任何帮助或提示! Thanks 谢谢

Because we do not know what values the lines in ggplot have we need to find out what breaks are set by default. 因为我们不知道ggplot中的行具有什么值,所以我们需要找出默认情况下设置的中断。 This is answered here and used in my code. 在这里得到回答并在我的代码中使用。

The following function says how many points are on the lines per group. 以下功能说明每组线上有多少个点。 You can further set a tolerance value what deviations from the line you accept. 您可以进一步设置一个tolerance值,该tolerance值与您接受的线有什么偏差。 Further, sometimes points my lay on different lines as in the case for ggplot(subset(data, Group == 22782), aes(x=x,y=y)) + geom_point() where point lay on two different lines (0 and 500). 此外,有时将我的点放在不同的线上,例如ggplot(subset(data, Group == 22782), aes(x=x,y=y)) + geom_point()下,点位于两条不同的线上(0和500)。

情节

For this case you can decide wether you want to know the sum of all points being on any line or if you are interested about the most points that are gathered about one line (here how many points are at 500). 对于这种情况,您可以决定是否要知道任何一条线上的所有点的总和,或者您是否对一条线上收集的最多的点感兴趣(这里有500个点)。 You can choose this with any_or_max_line . 您可以使用any_or_max_line进行选择。

The function 功能

points.on.lines <- function(data, tolerance, any_or_max_line){
# runs the code below per group
sapply(unique(data$Group), function(group_i){
  # chooses i-th group
  data_group_i <- subset(data, Group == group_i)
# find on which y-values the lines are
line_values <- 
  with(data_group_i,
       labeling::extended(range(y)[1], range(y)[2], m = 5))
# find out per line how many points are on or around that line
points_on_lines <- sapply(line_values, function(line_values_i){
  sum(data_group_i$y >= line_values_i - tolerance &
        data_group_i$y <= line_values_i + tolerance)})
# decides whether to take into account the line with most points or all points on any line
if(any_or_max_line == "max"){
  points_on_lines <- max(points_on_lines)
} else {
  points_on_lines <- sum(points_on_lines)
}
# names results by group
names(points_on_lines) <- paste0("Group_", group_i)
return(points_on_lines)
})}

Example

points.on.lines(data= data, tolerance= 50,
                any_or_max_line= "max")
Group_22782 Group_11553  Group_7059 
     45           3           4 

Let's assume that you know only a minority of points are not on the line. 假设您只知道少数点不在线上。 You also mention that you only want to consider horizontal lines. 您还提到只想考虑水平线。

In that case, you can use the median as a robust estimate of the horizontal line position. 在这种情况下,您可以将median用作水平线位置的可靠估计。 You could use the mean but it may be swayed by a extreme values which are not on the line anyway. 您可以使用mean但可能会受到极限值的影响,而这些极限值始终不在线上。

The code is self_explanatory: 代码是self_explanatory:

tolerance <- 10

data %>%
  group_by(Group) %>%
  mutate(y_line = median(y), 
         on_line = abs(y - y_line) <= tolerance) %>%
  count(Group, on_line)

Result: 结果:

#   Group on_line     n
#   <int> <lgl>   <int>
# 1  7059 FALSE       1
# 2  7059 TRUE        3
# 3 11553 FALSE       4
# 4 11553 TRUE        1
# 5 22782 FALSE      13
# 6 22782 TRUE       34

You can of course pipe that into filter(on_line) to keep only the count of points that are on the line. 当然,您可以将其通过管道传递到filter(on_line)以仅保留filter(on_line)的点数。

To me this seems like an interval optimisation problem (or more generally clustering of one-dimensional Data), that is unless you have fixed breaks or lines, one way I can think of to solve such a problem is the Jenks natural breaks optimization which is already implemented in R in the package BAMMtools 在我看来,这似乎是一个区间优化问题(或更普遍地说是一维数据的聚类),也就是说,除非您有固定的中断或行,否则我想解决该问题的一种方法就是Jenks自然中断优化 ,即已在BAMMtools包中的R中实现

You basically first fix the lines, and then see which points belong to which line (the closest line) 基本上,您首先要修复线,然后查看哪些点属于哪条线(最近的线)

One parameter you have to set is the number of lines (or rather clusters), in the function getJenksBreaks . 您必须设置的一个参数是getJenksBreaks函数中的行数(或更确切地说,是簇数)。

There might be other methods to cluster those points, but here's the jenks 可能还有其他方法可以对这些点进行聚类,但是这里有一些问题

library(BAMMtools)
lines <- getJenksBreaks(mydata$y, 5)
lines
# [1]   0   0 360 506 883
mydata <- mydata %>% 
  rowwise() %>% 
  mutate(line_id = as.character(which.min(abs(y-unique(lines))))) 

mydata %>% 
  group_by(Group, line_id) %>% 
  summarise(cnt =n()) %>% 
  group_by(Group) %>% 
  summarise(max_points = max(cnt))
# 
# # A tibble: 3 x 2
#   Group max_points
#   <int>      <dbl>
# 1  7059          4
# 2 11553          3
# 3 22782         45

mydata %>% 
  #filter(Group == 22782) %>% 
  ggplot(aes(x,y, color = line_id)) + 
  geom_point() +
  geom_hline(yintercept = lines, 
             color = 'red', 
             #alpha = 0.5, 
             linetype ='dashed', 
             size = 0.3) +
  facet_grid(.~Group)

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM