![](/img/trans.png)
[英]Python DataFrame - Select dataframe rows based on values in another dataframe
[英]Select rows of a dataframe based on another dataframe in Python
我有以下數據框:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df1)
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
我希望通過df2
選擇df1
中的行,如下所示:
df2 = pd.DataFrame({'A': 'foo bar'.split(),
'B': 'one two'.split()
})
print(df2)
A B
0 foo one
1 bar two
這是我在 Python 中嘗試過的方法,但我只是想知道是否還有另一種方法。 謝謝。
df = df1.merge(df2, on=['A','B'])
print(df)
這是預期的輸出。
A B C D
0 foo one 0 0
1 bar two 5 10
2 foo one 6 12
最簡單的是將merge
與內部連接一起使用。
過濾的另一種解決方案:
arr = [np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('r')]
df = df1[np.array(arr).any(axis=0)]
print(df)
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
或者使用Index.isin
創建MultiIndex
和過濾器:
df = df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
print(df)
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
方法#4。 .apply
+ key
功能:
>>> key = lambda row: (row.A, row.B)
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
方法#5。 .join
:
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
A B C D
0 foo one 0 0
6 foo one 6 12
5 bar two 5 10
已經提到的方法:
.merge
by @ahbon.to_dict('records')
過濾(最快)Index.isin
的Index.isin性能比較(最快到最慢):
>>> %%timeit
>>> df1[np.array([np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('records')]).any(axis=0)]
1.62 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> key = lambda row: (row.A, row.B)
>>> %%timeit
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
2.96 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1.merge(df2, on=['A','B'])
3.15 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
3.97 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
6.55 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The .merge method performs an inner join by default.
# The resulting dataframe will only have rows where the
# merge column value exists in both dataframes
x = df_only_english.merge(train_orders.assign(id=train_orders.id))
x
Unnamed: 0 language score id iso_language_name is_en cell_order
0 0 en 0.999998 00015c83e2717b English English 2e94bd7a 3e99dee9 b5e286ea da4f7550 c417225b 51e3cd89 2600b4eb 75b65993 cf195f8b 25699d02 72b3201a f2c750d3 de148b56...
1 1 en 0.999999 0001bdd4021779 English English 3fdc37be 073782ca 8ea7263c 80543cd8 38310c80 073e27e5 015d52a4 ad7679ef 7fde4f04 07c52510 0a1a7a39 0bcd3fef 58bf360b
2 2 en 0.999996 0001daf4c2c76d English English 97266564 a898e555 86605076 76cc2642 ef279279 df6c939f 2476da96 00f87d0a ae93e8e6 58aadb1d d20b0094 986fd4f1 b4ff1015...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.