I have a data frame df1 with two columns 'ids' and 'names' -
ids names
fhj56 abc
ty67s pqr
yu34o xyz
I have another data frame df2 which has some of the columns being -
user values
1 ['fhj56','fg7uy8']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
My result should give me those users from df2 whose values contains at least one of the ids from df1 and also tell which ids are responsible to put them into resultant table. Result should look like -
user values_responsible names
1 ['fhj56'] ['abc']
3 ['fhj56','ty67s'] ['abc','pqr']
User 2 doesn't come in resultant table because none of its values exist in df1.
I was trying to do it as follows -
df2.query('values in @df1.ids')
But this doesn't seem to work well.
You can iterate through the rows and then use .loc
together with isin
to find the matching rows from df2
. I converted this filtered dataframe into a dictionary
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.append(result['ids'].tolist())
names.append(result['names'].tolist())
users.append(row['user'])
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']]
user values_responsible names
0 1 [fhj56] [abc]
1 3 [fhj56, ty67s] [abc, pqr]
Or, for tidy data:
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.extend(result['ids'].tolist())
names.extend(result['names'].tolist())
users.extend([row['user']] * len(result['ids']))
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']])
user values_responsible names
0 1 fhj56 abc
1 3 fhj56 abc
2 3 ty67s pqr
Try this , using the idea of unnest a list cell.
Temp_unnest = pd.DataFrame([[i, x]
for i, y in df['values'].apply(list).iteritems()
for x in y], columns=list('IV'))
Temp_unnest['user']=Temp_unnest.I.map(df.user)
df1.index=df1.ids
Temp_unnest.assign(names=Temp_unnest.V.map(df1.names)).dropna().groupby('user')['V','names'].agg({(lambda x: list(x))})
Out[942]:
V names
<lambda> <lambda>
user
1 [fhj56] [abc]
3 [fhj56, ty67s] [abc, pqr]
I would refactor your second dataframe (essentially, normalizing your database). Something like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
Then, all you have to do is merge the first and second dataframe on the id column.
r = df2.merge(df1, left_on='id', right_on='ids', how='left')
You can exclude any gids for which some of the ids don't have a matching name.
r[~r[gid].isin( r[r['names'] == None][gid].unique() )]
where r[r['names'] == None][gid].unique()
finds all the gids that have no name and then r[~r[gid].isin( ... )]
grabs only entries that aren't in the list argument for isin
.
If you had more id groups, the second table might look like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
1 2 '1asdf3'
1 2 '7ada2a'
1 2 'asd341'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
which would be equivalent to
user values
1 ['fhj56','fg7uy8']
1 ['1asdf3', '7ada2a', 'asd341']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.