[英]How to check if all the elements in list in one pandas column are present in another pandas column
I have a lists in one column of a dataframe df1
, and I want to check to see for each row if all elements of that list are in another column that is in a second dataframe df2
.我在数据帧df1
一列中有一个列表,我想检查每一行是否该列表的所有元素都在第二个数据帧df2
中的另一列中。
The two dataframes are something like this:这两个数据框是这样的:
df1 df2
id | members | num | available |
1 |['a',b'] | one | ['a','b','c','d','e']|
2 |['b'] | two | ['a','b'] |
3 |['a','b','c'] | three| ['b','d','e'] |
I am trying to come up with a method that can give me which rows in df2
have all elements of members
for each row in df1
.我正在尝试提出一种方法,该方法可以为我提供df2
哪些行具有df1
每一行的所有members
元素。 Maybe something that yields this:也许会产生这样的结果:
id | members | which_cols |
1 |['a',b'] | ['one','two'] |
2 |['b'] | ['one','two','three'] |
3 |['a','b','c'] | ['one'] |
I thought converting it into dictionaries like {k: list(v) for k,v in df1.groupby("id")["members"]}
and {i: list(j) for i,j in df2.groupby("num")["available"]}
might make it more flexible to achieve the desired output but still haven't found a method to get to what I'm looking for.我{k: list(v) for k,v in df1.groupby("id")["members"]}
它转换成字典,比如{k: list(v) for k,v in df1.groupby("id")["members"]}
和{i: list(j) for i,j in df2.groupby("num")["available"]}
可能会更灵活地实现所需的输出,但仍然没有找到一种方法来获得我正在寻找的内容。
df2
will have about 300
rows with length of available
being as large as 25,000
. df2
将有大约300
行, available
长度与25,000
一样大。 And df1
can be as big as 1M
rows with list length in members
up to 15. So I think efficiency will also be important. df1
可以大到1M
行, members
列表长度最多为 15。所以我认为效率也很重要。
The core of the problem lies in your data setup.问题的核心在于您的数据设置。 If you do a bit of preprocessing, you can avoid tediously iterating through every list multiple times over.如果您进行一些预处理,则可以避免多次重复遍历每个列表。
df1 = pd.Series([['a', 'b'], ['b'], ['a', 'b', 'c']], name = 'members').to_frame()
df2 = pd.Series([['a', 'b', 'c', 'd', 'e'],
['a', 'b'],
['b', 'd', 'e']], name = 'available').to_frame()
df2.index = ['one', 'two', 'three']
>>> df1
members
0 ['a', 'b']
1 ['b']
2 ['a', 'b', 'c']
>>> df2
available
one. ['a', 'b', 'c', 'd', 'e']
two ['a', 'b']
three ['b', 'd', 'e']
If you one-hot encode your data before working with it, you put yourself at a great advantage for doing subset checks:如果您在使用数据之前对其进行一次性编码,那么您将在进行子集检查方面处于极大优势:
# You can do this many ways, but sklearn makes this very easy with:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = df1.join(pd.DataFrame(mlb.fit_transform(df1.pop('members')),
columns=mlb.classes_, index=df1.index))
mlb = MultiLabelBinarizer()
df2 = df2.join(pd.DataFrame(mlb.fit_transform(df2.pop('available')),
columns=mlb.classes_, index=df2.index))
>>> df1
a b c
0 1 1 0
1 0 1 0
2 1 1 1
>>> df2
a b c d e
one 1 1 1 1 1
two 1 1 0 0 0
three 0 1 0 1 1
The clever thing about this data format, is that now you can subtract df1
from df2
and if none of your resultant values are -1 (indicating a lack of an element in df2
, then you add that to the list. Think of this as overlaying the two dataframes (aligning each resource) and then subtracting. And of course, this can be vectorized:这种数据格式的巧妙之处在于,现在您可以从df2
减去df1
并且如果您的所有结果值都不是 -1(表示df2
缺少元素,那么您将其添加到列表中。将其视为叠加两个数据帧(对齐每个资源)然后相减。当然,这可以矢量化:
>>> df1.apply(lambda row: df2.index[((df2[df1.columns] - row) >= 0).all(axis = 1)], axis = 1)
0 Index(['one', 'two'], dtype='object')
1 Index(['one', 'two', 'three'], dtype='object')
2 Index(['one'], dtype='object')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.