![](/img/trans.png)
[英]How to check if all the elements in list in one pandas column are present in another pandas column
[英]How to check if all the elements in list are present in pandas column
我有一個數據框和一個列表:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
'char':[['a','b'],['a','b','c'],['a','c'],['b','c'],[],['c','a','d'],['c','d'],['a']]})
names = ['a','c']
只有當a
和c
都存在於char
列中時,我才想獲取行。(這里的順序無關緊要)
預期輸出:
char id
1 [a, b, c] 2
2 [a, c] 3
5 [c, a, d] 6
我的努力
true_indices = []
for idx, row in df.iterrows():
if all(name in row['char'] for name in names):
true_indices.append(idx)
ids = df[df.index.isin(true_indices)]
這給了我正確的輸出,但對於大型數據集來說太慢了,所以我正在尋找更有效的解決方案。
使用pd.DataFrame.apply
:
df[df['char'].apply(lambda x: set(names).issubset(x))]
輸出:
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
您可以從名稱列表構建一個集合以加快查找速度,並使用set.issubset
檢查集合中的所有元素是否都包含在列列表中:
names = set(['a','c'])
df[df['char'].map(names.issubset)]
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
將列表理解與issubset
使用:
mask = [set(names).issubset(x) for x in df['char']]
df = df[mask]
print (df)
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Series.map
的另一個解決方案:
df = df[df['char'].map(set(names).issubset)]
print (df)
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
性能取決於行數和匹配值的數量:
df = pd.concat([df] * 10000, ignore_index=True)
In [270]: %timeit df[df['char'].apply(lambda x: set(names).issubset(x))]
45.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [271]: %%timeit
...: names = set(['a','c'])
...: [names.issubset(set(row)) for _,row in df.char.iteritems()]
...:
46.7 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [272]: %%timeit
...: df[[set(names).issubset(x) for x in df['char']]]
...:
45.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [273]: %%timeit
...: df[df['char'].map(set(names).issubset)]
...:
18.3 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [274]: %%timeit
...: n = set(names)
...: df[df['char'].map(n.issubset)]
...:
16.6 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [279]: %%timeit
...: names = set(['a','c'])
...: m = [name.issubset(i) for i in df.char.values.tolist()]
...:
19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
試試這個。
df['char']=df['char'].apply(lambda x: x if ("a"in x and "c" in x) else np.nan)
print(df.dropna())
輸出:
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.