[英]Extracting specific rows from a data frame
I have a data frame df1 with two columns 'ids' and 'names' - 我有一个带有两列“ ids”和“ names”的数据框df1-
ids names
fhj56 abc
ty67s pqr
yu34o xyz
I have another data frame df2 which has some of the columns being - 我有另一个数据框df2,其中某些列是-
user values
1 ['fhj56','fg7uy8']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
My result should give me those users from df2 whose values contains at least one of the ids from df1 and also tell which ids are responsible to put them into resultant table. 我的结果应该给我那些来自df2的用户,这些用户的值至少包含来自df1的ID之一,并告诉哪些ID负责将其放入结果表中。 Result should look like - 结果应类似于-
user values_responsible names
1 ['fhj56'] ['abc']
3 ['fhj56','ty67s'] ['abc','pqr']
User 2 doesn't come in resultant table because none of its values exist in df1. 用户2不在结果表中,因为df1中不存在任何值。
I was trying to do it as follows - 我试图做到这一点如下-
df2.query('values in @df1.ids')
But this doesn't seem to work well. 但这似乎效果不佳。
You can iterate through the rows and then use .loc
together with isin
to find the matching rows from df2
. 您可以通过行迭代,然后使用.loc
连同isin
以找到匹配的行df2
。 I converted this filtered dataframe into a dictionary 我将此过滤后的数据框转换为字典
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.append(result['ids'].tolist())
names.append(result['names'].tolist())
users.append(row['user'])
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']]
user values_responsible names
0 1 [fhj56] [abc]
1 3 [fhj56, ty67s] [abc, pqr]
Or, for tidy data: 或者,对于整洁的数据:
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.extend(result['ids'].tolist())
names.extend(result['names'].tolist())
users.extend([row['user']] * len(result['ids']))
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']])
user values_responsible names
0 1 fhj56 abc
1 3 fhj56 abc
2 3 ty67s pqr
Try this , using the idea of unnest a list cell. 使用取消嵌套列表单元格的想法进行尝试。
Temp_unnest = pd.DataFrame([[i, x]
for i, y in df['values'].apply(list).iteritems()
for x in y], columns=list('IV'))
Temp_unnest['user']=Temp_unnest.I.map(df.user)
df1.index=df1.ids
Temp_unnest.assign(names=Temp_unnest.V.map(df1.names)).dropna().groupby('user')['V','names'].agg({(lambda x: list(x))})
Out[942]:
V names
<lambda> <lambda>
user
1 [fhj56] [abc]
3 [fhj56, ty67s] [abc, pqr]
I would refactor your second dataframe (essentially, normalizing your database). 我将重构您的第二个数据框(实质上是对数据库进行规范化)。 Something like 就像是
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
Then, all you have to do is merge the first and second dataframe on the id column. 然后,您要做的就是将id列上的第一个和第二个数据帧合并。
r = df2.merge(df1, left_on='id', right_on='ids', how='left')
You can exclude any gids for which some of the ids don't have a matching name. 您可以排除某些ID不具有匹配名称的所有ID。
r[~r[gid].isin( r[r['names'] == None][gid].unique() )]
where r[r['names'] == None][gid].unique()
finds all the gids that have no name and then r[~r[gid].isin( ... )]
grabs only entries that aren't in the list argument for isin
. 其中r[r['names'] == None][gid].unique()
查找所有没有名字的小女孩,然后r[~r[gid].isin( ... )]
只r[~r[gid].isin( ... )]
isin
list参数中。
If you had more id groups, the second table might look like 如果您有更多的ID组,第二个表可能看起来像
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
1 2 '1asdf3'
1 2 '7ada2a'
1 2 'asd341'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
which would be equivalent to 相当于
user values
1 ['fhj56','fg7uy8']
1 ['1asdf3', '7ada2a', 'asd341']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.