简体   繁体   English

按条件分组值

[英]Grouping values with condition

Suppose I have the following sorted data: 假设我有以下排序的数据:

[1] 0.06997360 0.09154285 0.10607553 0.10607631 0.10652390 0.10857691
[7] 0.10858633 0.10858633 0.10870369 0.18790912 0.18792473 0.19509030
[13] 0.20040993 0.22548593 0.22550167 0.22593338 0.22893103 0.23196562
[19] 0.25901025 0.30231663 0.30245132 0.30246287 0.34893528 0.34938294
[25] 0.34943865 0.45200544 0.45658957 0.45673815 0.46432602 0.48493226
[31] 0.58318915 0.58618472 0.66311458 0.66311774 0.69777062 0.69782017
[37] 0.70456743 0.70754232 0.71668759 0.71744870 0.71780786 0.77227751
[43] 0.79785928 0.79823416 0.79831632 0.79832545 0.79863813 0.79880086
[49] 0.91610076 0.91611498 0.91611830 0.91612582 0.91612582 0.91614856

Now I want to group them because many of them are almost equal. 现在,我想对它们进行分组,因为其中许多几乎相等。 Let's say the condition if two values x[i] and x[i+1] are in one group is if (x[i+1]-x[i]<0.01) so the 3rd up to the 9th value would be in one group. 假设两个值x [i]和x [i + 1]在一组中的条件是if(x [i + 1] -x [i] <0.01),那么第3个到第9个值将在一组。 These values should all be replaced (for example) with their mean. 这些值均应用均值替换(例如)。 I don't know how to do this ... Does anybody have a good idea how to come up with this or does even a function exist for this problem? 我不知道该怎么做。。。有人对这个问题有一个好主意吗?或者甚至存在针对这个问题的功能?

Here's how to do that. 这是这样做的方法。 I'm using dplyr to summarise by group. 我正在使用dplyr按组进行summarise First, I calculate a diff vector using a lag of one. 首先,我使用1的滞后来计算差异向量。 Then, I create a condition column that is 1 if the diff > 0.01 and 0 otherwise. 然后,如果diff> 0.01,则创建一个条件列,该条件列为1,否则为0。 Then, I use cumsum to put into groups. 然后,我用cumsum分组。 coalesce is used to deal with the NA at the beginning. coalesce用于coalesce NA。 Using the groups, you can now summarise . 使用这些组,您现在可以进行summarise

x <- c(0.06997360, 0.09154285, 0.10607553, 0.10607631, 0.10652390, 0.10857691,
0.10858633,0.10858633,0.10870369,0.18790912,0.18792473,0.19509030,
0.20040993,0.22548593,0.22550167,0.22593338,0.22893103,0.23196562,
0.25901025,0.30231663,0.30245132,0.30246287,0.34893528,0.34938294,
0.34943865,0.45200544,0.45658957,0.45673815,0.46432602,0.48493226,
0.58318915,0.58618472,0.66311458,0.66311774,0.69777062,0.69782017,
0.70456743,0.70754232,0.71668759,0.71744870,0.71780786,0.77227751,
0.79785928,0.79823416,0.79831632,0.79832545,0.79863813,0.79880086,
0.91610076,0.91611498,0.91611830,0.91612582,0.91612582,0.91614856)

library(dplyr)
as.data.frame(x)%>%
  mutate(diff=x-lag(x),
         condition=(diff>0.01)*1,
         group=cumsum(coalesce(condition, 0)))%>%
  group_by(group)%>%
  summarise(x_mean=mean(x))%>% 
  as.data.frame() 

   group     x_mean
1      0 0.06997360
2      1 0.09154285
3      2 0.10758986
4      3 0.19283352
5      4 0.22756353
6      5 0.25901025
7      6 0.30241027
8      7 0.34925229
9      8 0.45741479
10     9 0.48493226
11    10 0.58468694
12    11 0.66311616
13    12 0.70852067
14    13 0.77227751
15    14 0.79836237
16    15 0.91612237

In base R, you can return a named vector with tapply constructing the grouping mechanism with diff and cumsum like this 在基数R中,您可以通过如下方式返回命名向量: tapplydiffcumsum构造分组机制

tapply(x, cumsum(c(0, diff(x) > 0.01)), mean)

This returns 这返回

         0          1          2          3          4          5          6 
0.06997360 0.09154285 0.10758986 0.19283352 0.22756353 0.25901025 0.30241027 
         7          8          9         10         11         12         13 
0.34925229 0.45741479 0.48493226 0.58468694 0.66311616 0.70852067 0.77227751 
        14         15 
0.79836237 0.91612237 

You can put this in a data.frame like this 您可以将其放在这样的data.frame中

data.frame(groupMeans = tapply(x, cumsum(c(0, diff(x) > 0.01)), mean))

As Jaap mentions in the comments, a more direct method to return a data.frame is to use aggregate around the same grouping mechanism. 正如Jaap在评论中提到的那样,返回data.frame的更直接的方法是使用围绕相同分组机制的aggregate

aggregate(vec, list(cumsum(c(0, diff(vec) > 0.01))), mean)

This has the nice feature that the grouping vector is included as a variable in the data.frame. 这具有很好的功能,即分组向量作为变量包含在data.frame中。

Ronak shah's sapply with split is a more explicit method of splitting the data and calculating the means on it. Ronak Shah的sapplysplit是分裂的数据和计算它的手段更明确的方法。 tapply does this "under the hood." tapply做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM