Let's say I have 5 columns.
pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})
Is there a function to know the type of relationship each par of columns has? (one-to-one, one-to-many, many-to-one, many-to-many)
An output like:
Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column3 many-to-many
...
Column4 Column5 one-to-many
This should work for you:
df = pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})
def get_relation(df, col1, col2):
first_max = df[[col1, col2]].groupby(col1).count().max()[0]
second_max = df[[col1, col2]].groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return 'one-to-one'
else:
return 'one-to-many'
else:
if second_max==1:
return 'many-to-one'
else:
return 'many-to-many'
from itertools import product
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
print(col_i, col_j, get_relation(df, col_i, col_j))
output:
Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column1 many-to-one
Column2 Column3 many-to-many
Column2 Column4 many-to-one
Column2 Column5 many-to-many
Column3 Column1 many-to-one
Column3 Column2 many-to-many
Column3 Column4 many-to-one
Column3 Column5 many-to-many
Column4 Column1 one-to-one
Column4 Column2 one-to-many
Column4 Column3 one-to-many
Column4 Column5 one-to-many
Column5 Column1 many-to-one
Column5 Column2 many-to-many
Column5 Column3 many-to-many
Column5 Column4 many-to-one
This may not be a perfect answer, but it should work with some further modification:
a = df.nunique()
is9, is1 = a==9, a==1
one_one = is9[:, None] & is9
one_many = is1[:, None]
many_one = is1[None, :]
many_many = (~is9[:,None]) & (~is9)
pd.DataFrame(np.select([one_one, one_many, many_one],
['one-to-one', 'one-to-many', 'many-to-one'],
'many-to-many'),
df.columns, df.columns)
Output:
Column1 Column2 Column3 Column4 Column5
Column1 one-to-one many-to-many many-to-many one-to-one many-to-one
Column2 many-to-many many-to-many many-to-many many-to-many many-to-one
Column3 many-to-many many-to-many many-to-many many-to-many many-to-one
Column4 one-to-one many-to-many many-to-many one-to-one many-to-one
Column5 one-to-many one-to-many one-to-many one-to-many one-to-many
First we get all the combinations of the columns with itertools.product
:
Finally we use pd.merge
with validate
argument to check for which relationship "passes" the test with try, except
:
Notice, we leave out many_to_many
since this relationship is not "checked", quoted from docs:
“many_to_many” or “m:m”: allowed, but does not result in checks.
from itertools import product
def check_cardinality(df):
combinations_lst = list(product(df.columns, df.columns))
relations = ['one_to_one', 'one_to_many', 'many_to_one']
output = []
for col1, col2 in combinations_lst:
for relation in relations:
try:
pd.merge(df[[col1]], df[[col2]], left_on=col1, right_on=col2, validate=relation)
output.append([col1, col2, relation])
except:
continue
return output
cardinality = (pd.DataFrame(check_cardinality(df), columns=['first_column', 'second_column', 'cardinality'])
.drop_duplicates(['first_column', 'second_column'])
.reset_index(drop=True))
Output
first_column second_column cardinality
0 Column1 Column1 one_to_one
1 Column1 Column2 one_to_many
2 Column1 Column3 one_to_many
3 Column1 Column4 one_to_one
4 Column1 Column5 one_to_many
5 Column2 Column1 many_to_one
6 Column2 Column4 many_to_one
7 Column3 Column1 many_to_one
8 Column3 Column4 many_to_one
9 Column4 Column1 one_to_one
10 Column4 Column2 one_to_many
11 Column4 Column3 one_to_many
12 Column4 Column4 one_to_one
13 Column4 Column5 one_to_many
14 Column5 Column1 many_to_one
15 Column5 Column4 many_to_one
I tried to use Andrea's answer investigate some huge CSV files and was getting many-to-many for just about everything - even columns I was sure were 1-1. The problem was duplicates.
Here's a slightly modified version with a demo, and with a format that matches database terminology (and a description to remove ambiguity)
Doctors make many prescriptions which can each have several drugs prescribed, but each drug is made by one producer and each producer only makes one drug.
doctor prescription drug producer
0 Doctor Who 1 aspirin Bayer
1 Dr Welby 2 aspirin Bayer
2 Dr Oz 3 aspirin Bayer
3 Doctor Who 4 paracetamol Tylenol
4 Dr Welby 5 paracetamol Tylenol
5 Dr Oz 6 antibiotics Merck
6 Doctor Who 7 aspirin Bayer
Main changes to Andrea's:
report_df
in the function) to make it easier to read the results column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 2 drugs, and drugs up to 3 d...
2 doctor producer many-to-many doctors had up to 2 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 3 doctors, and doctors up to 2...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer 1-to-1 1 drug has 1 producer and vice versa
9 producer doctor many-to-many producers had up to 3 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug 1-to-1 1 producer has 1 drug and vice versa
These are based on my modified copy of Andrea's aglo without the drop-duplicates.
You can see how the last row - doctor-to-drug - is many-to-many when it should be 1-1 - that explains my initial results (which are hard to debug with 1000s of records)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 3 drugs, and drugs up to 4 d...
2 doctor producer many-to-many doctors had up to 3 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 4 doctors, and doctors up to 3...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer many-to-many drugs had up to 4 producers, and producers up ...
9 producer doctor many-to-many producers had up to 4 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug many-to-many producers had up to 4 drugs, and drugs up to 4...
from itertools import product
import pandas as pd
def get_relation(df, col1, col2):
# pair columns, drop duplicates (for proper 1-1), group by each column with
# the count of entries from the other column associated with each group
first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
else:
return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
else:
if second_max==1:
return '1-to-many', f'each {col1} has many {col2}s (some had {first_max})'
else:
return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'
def report_relations(df):
report = []
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
relation = get_relation(df, col_i, col_j)
report.append([col_i, col_j, *relation])
report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
# formating
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
# comment one of these two out depending on where you're using it
display(report_df) # for jupyter
print(report_df) # SO
test_df = pd.DataFrame({
'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
'prescription': [1, 2, 3, 4, 5, 6, 7],
'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']
})
display(test_df)
print(test_df)
report_relations(test_df)
Thank's Andrea - this helped me a lot.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.