简体   繁体   中英

How to map many to one relationship between two Pandas dataframes?

I am having trouble mapping a many-to-one relationship across two DataFrames. In my best attempts, I return unique rows with ambiguous group keys (there should just be 1, but instead I get multiple).

Consider my approach:

import numpy as np
import pandas as pd

# generate some data
df1 = pd.DataFrame(
    {
        "df1_key": [45, 46, 47, 48, 49],
        "df1_items": [
            "364740, 369904",
            "369904, 364740",
            "345251, 345254, 345262, 345264",
            "345262",
            "369904, 364740",
        ],
    }
)


df2 = pd.DataFrame(
    {
        "df2_key": [14, 15, 16, 17, 18, 19],
        "df2_items": [364740, 369904, 345251, 345254, 345262, 345264],
    }
)

# get groups of the first df
df1["group_key"] = pd.factorize(df1["df1_items"])[0]

# get a key-value mapping of unique rows to group keys
group_map = dict(zip(df1["df1_items"], df1["group_key"]))

# get storage container
results = np.empty((0, 3), int)

# for key, value in the key-value map
for key, value in group_map.items():
    # split the string
    current_key = [key.strip(" ") for key in key.split(",")]
    # for each component of the split string
    for i in current_key:
        # look up the value in df2 and retrieve its key and item
        findings = df2.loc[df2["df2_items"] == int(i)][["df2_key", "df2_items"]].values
        # concat the value from the dict to go along with the data above
        findings = np.concatenate(
            (findings, np.repeat(value, len(findings)).reshape(-1, 1)), axis=1
        )
        # store it all in a container
        results = np.append(results, findings, axis=0)
# make a df
df_results = pd.DataFrame(
    {"df2_key": results[:, 0], "id": results[:, 1], "group_key": results[:, 2]}
)
# keys are unfortunately associated with multiple group keys
df_results

The failure:

df2_key  id   group_key
    14  364740  0
    15  369904  0
    15  369904  1
    14  364740  1
    16  345251  2
    17  345254  2
    18  345262  2
    19  345264  2
    18  345262  3

If I understand correctly what you are trying to achieve, there are a few issues here. First of all df_items has "items" in mixed order (ie "364740, 369904" vs "369904, 364740" or "345262" which appears by itself and within a group).

Then you are doing several steps when basically you just want to explode the id row and possibly factorize then - or list out which rows contain the id that you need.

To do this it's better to transform the column contents in a list:

df1['clean_items'] = df1.apply(lambda x: sorted(x['df1_items'].replace(' ', '').split(',')), axis=1)

Approach 1, explode and factorize

You explode the clean_items column and factorize:

    df1 = df1.explode('clean_items')  
    df1["group_key"] = pd.factorize(df1["clean_items"])[0]

and here instead than going the dict route, you just groupby and merge to df2:

    df1_group = df1[['clean_items', 'group_key']].groupby(['clean_items', 'group_key']).size().reset_index()
    df1_group['clean_items'] = df1_group['clean_items'].astype('int64')
    result = df2.merge(df1_group[['clean_items', 'group_key']], left_on='df2_items', right_on='clean_items', how='left')

and your result is

       df2_key  df2_items  clean_items  group_key
    0       14     364740       364740          0
    1       15     369904       369904          1
    2       16     345251       345251          2
    3       17     345254       345254          3
    4       18     345262       345262          4
    5       19     345264       345264          5

Where each id appears only once.

Approach 2, map each id to multiple groups

The issue you have in df1 is that ids appear in more than 1 row. So an alternative you can show in which groups each id appears.

Here rather than factorizing on clean_items, you still factorize on df1_items , but then you also apply a group by, so you see which ids belong to a multiple group (and which group they belong to):

    df1 = df1.explode('clean_items')
    df1["group_key"] = pd.factorize(df1["df1_items"])[0]
    # aggregate and apply a set, so you have  a sorted list with each group each df1_item appears in
    df1b = df1.groupby(['clean_items'])['group_key'].apply(set).reset_index()

    # transform the set into a list for better handling
    df1b['group_key_list'] = df1b.apply(lambda x: list(x['group_key']), axis=1)

    # transform `clean_items` into an int and merge to df2
    df1b['clean_items'] = df1b['clean_items'].astype('int64')
    result_b = df2.merge(df1b[['clean_items', 'group_key_list']], left_on='df2_items', right_on='clean_items', how='left')

and your result now is

       df2_key  df2_items  clean_items group_key_list
    0       14     364740       364740         [0, 1]
    1       15     369904       369904         [0, 1]
    2       16     345251       345251            [2]
    3       17     345254       345254            [2]
    4       18     345262       345262         [2, 3]
    5       19     345264       345264            [2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM