简体   繁体   中英

Recursively count number of children in a dataframe by unique ids

I have a dataframe (about 4 million rows) where each row has a unique id, a parent id, and its level in the hierarchy:

   comment_id comment_parent comment_lvl
 1 1049997196     1049997055           3
 2 1052635064  2000116664444           0
 3 1053256308     1053255205           2
 4 1053367761     1053366805           1
 5 1054579447  2000117646770           0
 6 1054944680     1054821961           1
 7 1051053522     1051053049           6
 8 1052558482  2000116611974           0
 9 1056095951     1056095543           1
10 1053611186     1053565222           2

I would like to calculate the total amount of children of each top-level item ( comment_lvl == 0 ). My expected output is an aggregation like this:

   comment_id    comment_lvl    total_replies
 2 1052635064              0              123
 5 1054579447              0               45
 8 1052558482              0                2

I am not sure how to tackle this efficiently because the dataset is large and has a considerable depth (around 150).

Edit:

To provide a working example:

comment_id <- c('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', '1', '2', '3', '4', '5', '6', '7', '8')
parent_id <- c('0', 'A', 'B', 'B', 'A', 'E', 'F', 'G', '0', '1', '2', '2', '1', '5', '6', '7')
lvl <- c(0, 1, 2, 2, 1, 2, 3, 4, 0, 1, 2, 2, 1, 2, 3, 4)

df <- data.frame(comment_id, parent_id, lvl)

The data looks like this:

   comment_id parent_id lvl
1           A         0   0
2           B         A   1
3           C         B   2
4           D         B   2
5           E         A   1
6           F         E   2
7           G         F   3
8           H         G   4
9           1         0   0
10          2         1   1
11          3         2   2
12          4         2   2
13          5         1   1
14          6         5   2
15          7         6   3
16          8         7   4

Expected Result:

  comment_id    total_replies
1          A                7
2          1                7

Here is an option with igraph

library(igraph)
g <- graph_from_data_frame(subset(df[2:1], lvl != 0))
dfout <- rev(
  stack(
    sapply(
      with(df, unique(comment_id[lvl == 0])),
      function(x) sum(clusters(g)$membership == clusters(g)$membership[x]) - 1
    )
  )
)

which gives

  ind values
1   A      7
2   1      7

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM