简体   繁体   中英

Checking the type of relationship between columns in python/pandas? (one-to-one, one-to-many, or many-to-many)

Let's say I have 5 columns.

pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})

Is there a function to know the type of relationship each par of columns has? (one-to-one, one-to-many, many-to-one, many-to-many)

An output like:

Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column3 many-to-many
...
Column4 Column5 one-to-many

This should work for you:

df = pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})

def get_relation(df, col1, col2):        
    first_max = df[[col1, col2]].groupby(col1).count().max()[0]
    second_max = df[[col1, col2]].groupby(col2).count().max()[0]
    if first_max==1:
        if second_max==1:
            return 'one-to-one'
        else:
            return 'one-to-many'
    else:
        if second_max==1:
            return 'many-to-one'
        else:
            return 'many-to-many'

from itertools import product
for col_i, col_j in product(df.columns, df.columns):
    if col_i == col_j:
        continue
    print(col_i, col_j, get_relation(df, col_i, col_j))

output:

Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column1 many-to-one
Column2 Column3 many-to-many
Column2 Column4 many-to-one
Column2 Column5 many-to-many
Column3 Column1 many-to-one
Column3 Column2 many-to-many
Column3 Column4 many-to-one
Column3 Column5 many-to-many
Column4 Column1 one-to-one
Column4 Column2 one-to-many
Column4 Column3 one-to-many
Column4 Column5 one-to-many
Column5 Column1 many-to-one
Column5 Column2 many-to-many
Column5 Column3 many-to-many
Column5 Column4 many-to-one

This may not be a perfect answer, but it should work with some further modification:

a = df.nunique()
is9, is1 = a==9, a==1
one_one = is9[:, None] & is9
one_many = is1[:, None]
many_one = is1[None, :]
many_many = (~is9[:,None]) & (~is9)

pd.DataFrame(np.select([one_one, one_many, many_one],
                       ['one-to-one', 'one-to-many', 'many-to-one'],
                       'many-to-many'),
             df.columns, df.columns)

Output:

              Column1       Column2       Column3       Column4      Column5
Column1    one-to-one  many-to-many  many-to-many    one-to-one  many-to-one
Column2  many-to-many  many-to-many  many-to-many  many-to-many  many-to-one
Column3  many-to-many  many-to-many  many-to-many  many-to-many  many-to-one
Column4    one-to-one  many-to-many  many-to-many    one-to-one  many-to-one
Column5   one-to-many   one-to-many   one-to-many   one-to-many  one-to-many

First we get all the combinations of the columns with itertools.product :

Finally we use pd.merge with validate argument to check for which relationship "passes" the test with try, except :

Notice, we leave out many_to_many since this relationship is not "checked", quoted from docs:

“many_to_many” or “m:m”: allowed, but does not result in checks.

from itertools import product

def check_cardinality(df):

    combinations_lst = list(product(df.columns, df.columns))
    relations = ['one_to_one', 'one_to_many', 'many_to_one']

    output = []
    for col1, col2 in combinations_lst:
        for relation in relations:
            try:
                pd.merge(df[[col1]], df[[col2]], left_on=col1, right_on=col2, validate=relation)
                output.append([col1, col2, relation])
            except:
                continue

    return output

cardinality = (pd.DataFrame(check_cardinality(df), columns=['first_column', 'second_column', 'cardinality'])
               .drop_duplicates(['first_column', 'second_column'])
               .reset_index(drop=True))

Output

   first_column second_column  cardinality
0       Column1       Column1   one_to_one
1       Column1       Column2  one_to_many
2       Column1       Column3  one_to_many
3       Column1       Column4   one_to_one
4       Column1       Column5  one_to_many
5       Column2       Column1  many_to_one
6       Column2       Column4  many_to_one
7       Column3       Column1  many_to_one
8       Column3       Column4  many_to_one
9       Column4       Column1   one_to_one
10      Column4       Column2  one_to_many
11      Column4       Column3  one_to_many
12      Column4       Column4   one_to_one
13      Column4       Column5  one_to_many
14      Column5       Column1  many_to_one
15      Column5       Column4  many_to_one

I tried to use Andrea's answer investigate some huge CSV files and was getting many-to-many for just about everything - even columns I was sure were 1-1. The problem was duplicates.

Here's a slightly modified version with a demo, and with a format that matches database terminology (and a description to remove ambiguity)

Firstly a clearer example

Doctors make many prescriptions which can each have several drugs prescribed, but each drug is made by one producer and each producer only makes one drug.

       doctor  prescription         drug producer
0  Doctor Who             1      aspirin    Bayer
1    Dr Welby             2      aspirin    Bayer
2       Dr Oz             3      aspirin    Bayer
3  Doctor Who             4  paracetamol  Tylenol
4    Dr Welby             5  paracetamol  Tylenol
5       Dr Oz             6  antibiotics    Merck
6  Doctor Who             7      aspirin    Bayer

Correct results from my functions below

Main changes to Andrea's:

  • drop_duplicates on the pairs so that 1-1 does not get seen as many-many
  • I put the results in a dataframe (see report_df in the function) to make it easier to read the results
  • I reversed the logic to match UML terms (I'm staying out of the debate of set vs UML - this is just the way I wanted it)
        column 1      column 2   cardinality                                        description
0         doctor  prescription     1-to-many   each doctor has many prescriptions (some  had 3)
1         doctor          drug  many-to-many  doctors had up to 2 drugs, and drugs up to 3 d...
2         doctor      producer  many-to-many  doctors had up to 2 producers, and producers u...
3   prescription        doctor     many-to-1             many prescriptions (max 3) to 1 doctor
4   prescription          drug     many-to-1               many prescriptions (max 4) to 1 drug
5   prescription      producer     many-to-1           many prescriptions (max 4) to 1 producer
6           drug        doctor  many-to-many  drugs had up to 3 doctors, and doctors up to 2...
7           drug  prescription     1-to-many     each drug has many prescriptions (some  had 4)
8           drug      producer        1-to-1               1 drug has 1 producer and vice versa
9       producer        doctor  many-to-many  producers had up to 3 doctors, and doctors up ...
10      producer  prescription     1-to-many  each producer has many prescriptions (some  ha...
11      producer          drug        1-to-1               1 producer has 1 drug and vice versa

WRONG results without drop-duplicate below

These are based on my modified copy of Andrea's aglo without the drop-duplicates.

You can see how the last row - doctor-to-drug - is many-to-many when it should be 1-1 - that explains my initial results (which are hard to debug with 1000s of records)

           column 1      column 2   cardinality                                        description
0         doctor  prescription     1-to-many   each doctor has many prescriptions (some  had 3)
1         doctor          drug  many-to-many  doctors had up to 3 drugs, and drugs up to 4 d...
2         doctor      producer  many-to-many  doctors had up to 3 producers, and producers u...
3   prescription        doctor     many-to-1             many prescriptions (max 3) to 1 doctor
4   prescription          drug     many-to-1               many prescriptions (max 4) to 1 drug
5   prescription      producer     many-to-1           many prescriptions (max 4) to 1 producer
6           drug        doctor  many-to-many  drugs had up to 4 doctors, and doctors up to 3...
7           drug  prescription     1-to-many     each drug has many prescriptions (some  had 4)
8           drug      producer  many-to-many  drugs had up to 4 producers, and producers up ...
9       producer        doctor  many-to-many  producers had up to 4 doctors, and doctors up ...
10      producer  prescription     1-to-many  each producer has many prescriptions (some  ha...
11      producer          drug  many-to-many  producers had up to 4 drugs, and drugs up to 4...

New functions

from itertools import product
import pandas as pd

def get_relation(df, col1, col2):
    # pair columns, drop duplicates (for proper 1-1), group by each column with 
    # the count of entries from the other column associated with each group 
    first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
    second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
    if first_max==1:
        if second_max==1:
            return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
        else:
            return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
    else:
        if second_max==1:
            return '1-to-many', f'each {col1} has many {col2}s (some  had {first_max})'
        else:
            return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'

def report_relations(df):
    report = [] 
    for col_i, col_j in product(df.columns, df.columns):
        if col_i == col_j:
            continue
        relation = get_relation(df, col_i, col_j)
        report.append([col_i, col_j, *relation])
    report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
    # formating
    pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
    # comment one of these two out depending on where you're using it
    display(report_df) # for jupyter
    print(report_df)   # SO

test_df = pd.DataFrame({
    'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
    'prescription': [1, 2, 3, 4, 5, 6, 7],
    'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
    'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']

})

display(test_df)
print(test_df)
report_relations(test_df)

Thank's Andrea - this helped me a lot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM