I am having trouble mapping a many-to-one relationship across two DataFrames. In my best attempts, I return unique rows with ambiguous group keys (there should just be 1, but instead I get multiple).
Consider my approach:
import numpy as np
import pandas as pd
# generate some data
df1 = pd.DataFrame(
{
"df1_key": [45, 46, 47, 48, 49],
"df1_items": [
"364740, 369904",
"369904, 364740",
"345251, 345254, 345262, 345264",
"345262",
"369904, 364740",
],
}
)
df2 = pd.DataFrame(
{
"df2_key": [14, 15, 16, 17, 18, 19],
"df2_items": [364740, 369904, 345251, 345254, 345262, 345264],
}
)
# get groups of the first df
df1["group_key"] = pd.factorize(df1["df1_items"])[0]
# get a key-value mapping of unique rows to group keys
group_map = dict(zip(df1["df1_items"], df1["group_key"]))
# get storage container
results = np.empty((0, 3), int)
# for key, value in the key-value map
for key, value in group_map.items():
# split the string
current_key = [key.strip(" ") for key in key.split(",")]
# for each component of the split string
for i in current_key:
# look up the value in df2 and retrieve its key and item
findings = df2.loc[df2["df2_items"] == int(i)][["df2_key", "df2_items"]].values
# concat the value from the dict to go along with the data above
findings = np.concatenate(
(findings, np.repeat(value, len(findings)).reshape(-1, 1)), axis=1
)
# store it all in a container
results = np.append(results, findings, axis=0)
# make a df
df_results = pd.DataFrame(
{"df2_key": results[:, 0], "id": results[:, 1], "group_key": results[:, 2]}
)
# keys are unfortunately associated with multiple group keys
df_results
The failure:
df2_key id group_key
14 364740 0
15 369904 0
15 369904 1
14 364740 1
16 345251 2
17 345254 2
18 345262 2
19 345264 2
18 345262 3
If I understand correctly what you are trying to achieve, there are a few issues here. First of all df_items
has "items" in mixed order (ie "364740, 369904"
vs "369904, 364740"
or "345262"
which appears by itself and within a group).
Then you are doing several steps when basically you just want to explode the id row and possibly factorize then - or list out which rows contain the id that you need.
To do this it's better to transform the column contents in a list:
df1['clean_items'] = df1.apply(lambda x: sorted(x['df1_items'].replace(' ', '').split(',')), axis=1)
You explode the clean_items
column and factorize:
df1 = df1.explode('clean_items')
df1["group_key"] = pd.factorize(df1["clean_items"])[0]
and here instead than going the dict route, you just groupby and merge to df2:
df1_group = df1[['clean_items', 'group_key']].groupby(['clean_items', 'group_key']).size().reset_index()
df1_group['clean_items'] = df1_group['clean_items'].astype('int64')
result = df2.merge(df1_group[['clean_items', 'group_key']], left_on='df2_items', right_on='clean_items', how='left')
and your result is
df2_key df2_items clean_items group_key
0 14 364740 364740 0
1 15 369904 369904 1
2 16 345251 345251 2
3 17 345254 345254 3
4 18 345262 345262 4
5 19 345264 345264 5
Where each id
appears only once.
The issue you have in df1
is that ids appear in more than 1 row. So an alternative you can show in which groups each id appears.
Here rather than factorizing on clean_items, you still factorize on df1_items
, but then you also apply a group by, so you see which ids
belong to a multiple group (and which group they belong to):
df1 = df1.explode('clean_items')
df1["group_key"] = pd.factorize(df1["df1_items"])[0]
# aggregate and apply a set, so you have a sorted list with each group each df1_item appears in
df1b = df1.groupby(['clean_items'])['group_key'].apply(set).reset_index()
# transform the set into a list for better handling
df1b['group_key_list'] = df1b.apply(lambda x: list(x['group_key']), axis=1)
# transform `clean_items` into an int and merge to df2
df1b['clean_items'] = df1b['clean_items'].astype('int64')
result_b = df2.merge(df1b[['clean_items', 'group_key_list']], left_on='df2_items', right_on='clean_items', how='left')
and your result now is
df2_key df2_items clean_items group_key_list
0 14 364740 364740 [0, 1]
1 15 369904 369904 [0, 1]
2 16 345251 345251 [2]
3 17 345254 345254 [2]
4 18 345262 345262 [2, 3]
5 19 345264 345264 [2]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.