简体   繁体   中英

How to consolidate two id columns using R or Python, identifying which rows belong to same set of related IDs

I have 2 ID columns that are created/collected independently. I'm trying to consolidate these two ID columns into one by determining which rows are part of the same related group of ids based on either of the two ID columns. I would consider the rows to be related based on a few rules:

1: If a LOAN has the same value in multiple rows, they belong to the same group (in the example for reference only.) I've called it loan_group. No issues here.

2: If a COLLATERAL has the same value in multiple rows, they belong to the temporary group. I've called it collateral_group (same rule as #1.) No issues here.

3: Finally, and I'm not sure how to phrase this exactly, but any time there is overlap between values that are part of the same group (across loan and collateral columns), those groups should be further consolidated. For example:

df <- data.frame('LOAN' = c('L1', 'L2', 'L5', 'L2', 'L6', 'L7', 'L8'),
                 'COLLATERAL' = c('C1', 'C1', 'C8', 'C4', 'C8', 'C9', 'C4'))
df$laon_group <- as.numeric(factor(df$LOAN))
df$collateral_group <- as.numeric(factor(df$COLLATERAL))
df$final_grouping <- NA
LOAN  COLLATERAL  loan_group  collateral_group  final_grouping
----  ----------- ----------  ----------------  --------------
L1    C1*         1           1                 **1**
L2**  C1*         2           1                 **1**
L5    C8          3           2                 2
L2**  C4***       2           3                 **1**
L6    C8          4           2                 2
L7    C9          5           4                 3
L8    C4***       6           3                 **1**

*because rows 1 and 2 both have the value C1, they would be assigned to the same final grouping

**because row 2 has the LOAN value L2, this means we can assign row 4 the final grouping of '1' because that row can be linked back to row 1 via the L2/C1 link

***finally, because row 4 includes the COLLATERAL value C4, this means we can include row 7 in the consolidated final grouping. That row can be linked back to row one via the L2/C4 & L2/C1 links

The data set is roughly 15m unique combinations of LOAN + COLLATERAL. The groups will likely crossover a few thousand (maybe +10 thousand) IDs in some edge cases. I ran into some resource issues on BQ testing some solutions, including the suggestions from my original question which is why i'd like to attempt to do this in R/Python instead

If you treat this as a graph problem, you can do something like:

library(igraph)

g <- make_empty_graph(directed = FALSE, n = nrow(tab))

for (loan_id in unique(tab$loan)) {
    loan_idx = which(tab$loan == loan_id)
    if (length(loan_idx) >= 2) {
        g <- g + path(loan_idx)
    }
}

for (collateral_id in unique(tab$collateral)) {
    collateral_idx = which(tab$collateral == collateral_id)
    if (length(collateral_idx) >= 2) {
        g <- g + path(collateral_idx)
    }
}

tab$grouping = components(g)$membership

ie you make a graph and add edges between any rows with matching loan or collateral IDs. I'm not sure how optimised this is though, as for loops in R are rarely the right answer.

The output matches your expected output:

> tab
  loan collateral loan_group collateral_group final grouping
1   L1         C1          1                1     1        1
2   L2         C1          2                1     1        1
3   L5         C8          3                2     2        2
4   L2         C4          2                3     1        1
5   L6         C8          4                2     2        2
6   L7         C9          5                4     3        3
7   L8         C4          6                3     1        1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM