I have a dataframe (about 4 million rows) where each row has a unique id, a parent id, and its level in the hierarchy:
comment_id comment_parent comment_lvl
1 1049997196 1049997055 3
2 1052635064 2000116664444 0
3 1053256308 1053255205 2
4 1053367761 1053366805 1
5 1054579447 2000117646770 0
6 1054944680 1054821961 1
7 1051053522 1051053049 6
8 1052558482 2000116611974 0
9 1056095951 1056095543 1
10 1053611186 1053565222 2
I would like to calculate the total amount of children of each top-level item ( comment_lvl == 0
). My expected output is an aggregation like this:
comment_id comment_lvl total_replies
2 1052635064 0 123
5 1054579447 0 45
8 1052558482 0 2
I am not sure how to tackle this efficiently because the dataset is large and has a considerable depth (around 150).
Edit:
To provide a working example:
comment_id <- c('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', '1', '2', '3', '4', '5', '6', '7', '8')
parent_id <- c('0', 'A', 'B', 'B', 'A', 'E', 'F', 'G', '0', '1', '2', '2', '1', '5', '6', '7')
lvl <- c(0, 1, 2, 2, 1, 2, 3, 4, 0, 1, 2, 2, 1, 2, 3, 4)
df <- data.frame(comment_id, parent_id, lvl)
The data looks like this:
comment_id parent_id lvl
1 A 0 0
2 B A 1
3 C B 2
4 D B 2
5 E A 1
6 F E 2
7 G F 3
8 H G 4
9 1 0 0
10 2 1 1
11 3 2 2
12 4 2 2
13 5 1 1
14 6 5 2
15 7 6 3
16 8 7 4
Expected Result:
comment_id total_replies
1 A 7
2 1 7
Here is an option with igraph
library(igraph)
g <- graph_from_data_frame(subset(df[2:1], lvl != 0))
dfout <- rev(
stack(
sapply(
with(df, unique(comment_id[lvl == 0])),
function(x) sum(clusters(g)$membership == clusters(g)$membership[x]) - 1
)
)
)
which gives
ind values
1 A 7
2 1 7
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.