[英]Create subgroups within a factor based on the sequencing of another column
我試圖在基於特定列的因子內創建子組。 這是一個名為“test”的示例數據集,類似於我正在使用的數據集。
structure(list(old.id = c("A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C"
), id.number = c(1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2,
3, 4, 1, 2, 3), X = c(0.859207813394842, 0.636238617960869, 0.507899267816508,
0.400124367809121, 0.867246955862074, 0.620089503630128, 0.493032629079145,
0.702937523522877, 0.897875765710176, 0.360667580073056, 0.931321208973492,
0.298666640389948, 0.94444119643156, 0.223731238077921, 0.705733544607941,
0.354808093410256, 0.196606367677969, 0.67764700709383, 0.510474776312792,
0.214473998493235), Y = c(44, 41, 43, 61, 41, 51, 55, 34, 41,
63, 15, 77, 57, 73, 60, 71, 73, 16, 50, 19), Z = c(322, 349,
395, 300, 368, 357, 385, 306, 385, 377, 323, 335, 314, 372, 372,
362, 311, 301, 332, 314), Factor1 = c("Y", "N", "N", "N", "Y",
"N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y", "Y", "N", "Y",
"N", "N"), Factor2 = c("L", "M", "H", "L", "H", "L", "L", "M",
"H", "H", "H", "M", "L", "H", "H", "H", "L", "H", "L", "M")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
我的兩個主要目標:
如果不是添加了“id.number”列的順序,我可以通過使用輕松地匿名化 id
library(tidyverse)
new_test=test %>% mutate(new_id=group_indices(.,old.id))
我無法確定如何對結果進行分組並使用“id.number”分配新的 id。 下面是我希望的結果。
structure(list(old.id = c("A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C"
), id.number = c(1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2,
3, 4, 1, 2, 3), new.id = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3,
3, 4, 4, 4, 4, 5, 5, 5), X = c(0.859207813394842, 0.636238617960869,
0.507899267816508, 0.400124367809121, 0.867246955862074, 0.620089503630128,
0.493032629079145, 0.702937523522877, 0.897875765710176, 0.360667580073056,
0.931321208973492, 0.298666640389948, 0.94444119643156, 0.223731238077921,
0.705733544607941, 0.354808093410256, 0.196606367677969, 0.67764700709383,
0.510474776312792, 0.214473998493235), Y = c(44, 41, 43, 61,
41, 51, 55, 34, 41, 63, 15, 77, 57, 73, 60, 71, 73, 16, 50, 19
), Z = c(322, 349, 395, 300, 368, 357, 385, 306, 385, 377, 323,
335, 314, 372, 372, 362, 311, 301, 332, 314), Factor1 = c("Y",
"N", "N", "N", "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y",
"Y", "Y", "N", "Y", "N", "N"), Factor2 = c("L", "M", "H", "L",
"H", "L", "L", "M", "H", "H", "H", "M", "L", "H", "H", "H", "L",
"H", "L", "M")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
因此,如果我們查看“old.id”=A 並且“id.number”字段循環回 1 時,它定義了一個新的事件“鏈”並被分配了一個“new.id”編號。我的實際數據集有 60 列大約 500,000 行,任何解決方案都需要擴展到數百萬行。我更喜歡整潔的解決方案,這樣我就可以將其添加到現有的整潔管道中,但我會很感激任何有效的方法。謝謝
alistaire 在上面的評論中為我的問題提供了一個非常好的解決方案。 這里是:
df$new.id <- cumsum(df$id.number == 1) or in dplyr, df <- df %>% mutate(new.id = cumsum(id.number == 1))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.