使用pandas数据帧作为查找表

Question

Given a single row from dataframe X , what is the most efficient way to retrieve all rows from dataframe Y that completely match the query row? 给定数据帧X单行，从数据框Y中检索与查询行完全匹配的所有行的最有效方法是什么？

Example: querying row [0,1,0,1] from 示例： [0,1,0,1]查询行[0,1,0,1]

[
 [0,1,0,1, 1.0],
 [0,1,0,1, 2.0],
 [0,1,0,0, 3.0],
 [1,1,0,0, 0.5],
]

should return 应该回来

[
 [0,1,0,1, 1.0],
 [0,1,0,1, 2.0],
]

X and Y are assumed to have the same schema, except that Y has an additional target value column. 假设X和Y具有相同的模式，除了Y具有附加目标值列。 There may be one, zero, or many matches. 可能有一个，零个或多个匹配。 The solution should be efficient even with thousands of columns. 即使有数千列，该解决方案也应该是高效的。

Answer 1

Use boolean indexing : 使用boolean indexing ：

L = [
 [0,1,0,1, 1.0],
 [0,1,0,1, 2.0],
 [0,1,0,0, 3.0],
 [1,1,0,0, 0.5],
]
df = pd.DataFrame(L)

Y = [0,1,0,1]


print (df[df.iloc[:, :len(Y)].eq(Y).all(axis=1)])

   0  1  2  3    4
0  0  1  0  1  1.0
1  0  1  0  1  2.0

Explanation : 说明：

First select first N columns by length of sequence: 首先按序列长度选择前N列：

print (df.iloc[:, :len(Y)])
   0  1  2  3
0  0  1  0  1
1  0  1  0  1
2  0  1  0  0
3  1  1  0  0

Compare all rows by first row selected by eq and loc : 比较eq和loc选择的第一行的所有行：

print (df.iloc[:, :len(Y)].eq(Y))
       0     1     2      3
0   True  True  True   True
1   True  True  True   True
2   True  True  True  False
3  False  True  True  False

And check if match by DataFrame.all for check all True s per row: 并检查是否匹配DataFrame.all以检查每行的所有True ：

print (df.iloc[:, :len(Y)].eq(Y).all(1))
0     True
1     True
2    False
3    False
dtype: bool

Answer 2

I'd go with merge : 我会选择合并：

import pandas as pd

y = pd.DataFrame({'A': [1, 1, 3],
                  'B': list('aac'),
                  'C': list('ddf'),
                  'D': [4, 5, 6]})

x = pd.DataFrame([[1, 'a', 'd']],
                 columns=list('ABC'))

match = x.merge(y, on=x.columns.tolist())

match
#   A  B  C  D
#0  1  a  d  4
#1  1  a  d  5

Answer 3

One efficient way is to drop down to numpy and query individual columns: 一种有效的方法是下拉到numpy并查询单个列：

Data from @jezrael. 来自@jezrael的数据。

import pandas as pd, numpy as np

df = pd.DataFrame({'A':list('abadef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,7,4,2,3],
                   'D':[1,3,1,7,1,0],
                   'E':[5,3,5,9,2,4],
                   'F':list('aaabbb')})

vals = df.values
arr = [4, 7, 1, 5]

mask = np.logical_and.reduce([vals[:, i+1]==arr[i] for i in range(len(arr))])
res = df.iloc[np.where(mask)[0]]

print(res)

#    A  B  C  D  E  F
# 0  a  4  7  1  5  a
# 2  a  4  7  1  5  a

使用pandas数据帧作为查找表

问题描述

3 个解决方案

解决方案1
2 2018-04-24 13:07:43

解决方案2
1 已采纳 2018-04-24 13:12:24

解决方案3
1 2018-04-24 13:19:09

使用pandas数据帧作为查找表

问题描述

3 个解决方案

解决方案1 2 2018-04-24 13:07:43

解决方案2 1 已采纳 2018-04-24 13:12:24

解决方案3 1 2018-04-24 13:19:09

解决方案1
2 2018-04-24 13:07:43

解决方案2
1 已采纳 2018-04-24 13:12:24

解决方案3
1 2018-04-24 13:19:09