简体   繁体   English

为现有列值创建新顺序而不重新排序数据框中的行 - R

[英]create new order for existing column values without reordering rows in dataframe - R

I have some results cluster labels from kmeans done on different ids (reprex example below).我有一些来自 kmeans 的结果聚类标签,这些标签在不同的 id 上完成(下面的代表示例)。 the problem is the kmeans clusters codes are not ordered consistently across ids although all ids have 3 clusters.问题是尽管所有 id 都有 3 个集群,但 kmeans 集群代码在 id 之间的排序不一致。

reprex = data.frame(id = rep(1:2, each = 41, 
           v1 = rep(seq(1:4), 2),
           cluster = c(2,2,1,3,3,1,2,2))

reprex
   id v1 cluster
1  1  1       2
2  1  2       2
3  1  3       1
4  1  4       3
5  2  1       3
6  2  2       1
7  2  3       2
8  2  4       2

what I want is that the variable cluster should always start with 1 within each ID.我想要的是变量簇应该总是在每个 ID 中以 1 开头。 Note I don't want to reorder that dataframe by cluster, the order needs to remain the same.注意我不想按集群重新排序该数据帧,顺序需要保持不变。 so the desired result would be:所以想要的结果是:

reprex_desired<- data.frame(id = rep(1:2, each = 4), 
           v1 = rep(seq(1:4), 2),
           cluster = c(2,2,1,3,3,1,2,2),
           what_iWant = c(1,1,2,3,1,2,3,3))

reprex_desired
  id v1 cluster what_iWant
1  1  1       2          1
2  1  2       2          1
3  1  3       1          2
4  1  4       3          3
5  2  1       3          1
6  2  2       1          2
7  2  3       2          3
8  2  4       2          3

We can use match after grouping by 'id'我们可以在按 'id' 分组后使用match

library(dplyr)
reprex <- reprex %>%
     group_by(id) %>% 
     mutate(what_IWant = match(cluster, unique(cluster))) %>%
     ungroup

-output -输出

reprex
# A tibble: 8 × 4
     id    v1 cluster what_IWant
  <int> <int>   <dbl>      <int>
1     1     1       2          1
2     1     2       2          1
3     1     3       1          2
4     1     4       3          3
5     2     1       3          1
6     2     2       1          2
7     2     3       2          3
8     2     4       2          3

Here is a version with cumsum combined with lag :这是cumsumlag结合的版本:

library(dplyr)
df %>% 
  group_by(id) %>% 
  mutate(what_i_want = cumsum(cluster != lag(cluster, def = first(cluster)))+1)
     id    v1 cluster what_i_want
  <int> <int>   <dbl>       <dbl>
1     1     1       2           1
2     1     2       2           1
3     1     3       1           2
4     1     4       3           3
5     2     1       3           1
6     2     2       1           2
7     2     3       2           3
8     2     4       2           3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM