[英]Pandas and Numpy Fancy Indexing
I am having a hard time with the following. 我在以下方面遇到了困难。 I have a pandas
N x D
dataframe called p
with some missing ( NAN
) values. 我有一个名为
p
的熊猫N x D
数据框,但缺少一些( NAN
)值。 I have another corresponding array indexed by D x K x T
. 我有另一个对应的数组,由
D x K x T
索引。 I want to make a map of every panda's entry n,d
in the data frame to a[d][k][p[n,d]]
for all possible k resulting in N x D x K
matrix. 我想将所有可能的k的数据框中的每个熊猫条目
n,d
映射到a[d][k][p[n,d]]
以得到N x D x K
矩阵。 Can I have some help as to how to do this most efficiently with the Pandas and Numpy library? 关于如何使用Pandas and Numpy库最有效地执行此操作,我可以寻求帮助吗?
I actually then take the N x D
part of the final matrix and take the product along the columns leaving an N x K
matrix. 然后,我实际上取了最终矩阵的
N x D
部分,然后沿列取乘积,剩下一个N x K
矩阵。 The final output can be (slowly) reproduced by the following: 最终的输出可以(缓慢地)通过以下方式再现:
def generate_entry(i, j):
result = np.prod([alpha[s][j][int(p.loc[i][s])] for s in range(num_features) if not isNaN(p.loc[i][s]) ])
return result
vgenerate_entry = np.vectorize(generate_entry)
result = np.fromfunction(vgenerate_entry, shape=(len(p), k), dtype=int)
I think some use of pandas.get_dummies
would be helpful for matrix multiplication but I can't quite figure that out. 我认为对
pandas.get_dummies
某些使用将有助于矩阵乘法,但我不太清楚。
The following is much faster: 以下是更快的方法:
r = None
for i in range(num_features):
rel_data = pd.get_dummies(data.ix[:,i])
rel_probs = alpha[i].T
prod = rel_data.dot(rel_probs)
prod[prod == 0] = 1
if r is None:
r = prod
else:
r = r.multiply(prod)
r = r.as_matrix()
r = r * pi
posteriers = r / np.sum(r, axis=1)[:, np.newaxis]
Here's one approach to index into the NumPy array a
with the pandas dataframe p
that has NaNs
, which are to be avoided and we are filling some value fillval
in those places - 这是一种使用具有
NaNs
的熊猫数据帧p
索引NumPy数组a
的一种方法,应避免这种情况,我们在这些地方填充了一些值fillval
def fancy_indexing_avoid_NaNs(p, a, fillval = 1):
# Extract values from p and get NaN mask
pv = p.values
mask = np.isnan(pv)
# Get int version, replacing NaNs with some number, say 0
p_idx = np.where(mask, 0, pv).astype(int)
# FANCY-INDEX into array 'a' with those indices fron p
a_indexed_vals = a[np.arange(D), np.arange(K)[:,None,None],p_idx]
# FANCY-INDEX once more to replace the values set by NaNs as 1s, so
# that in the prod-reduction later on they would have no effect
a_indexed_vals[np.arange(K)[:,None,None],mask] = fillval
return a_indexed_vals
That fillval
would be application dependent. 该
fillval
将取决于应用程序。 In this case, we are using prod
, so a fillval=1
makes sense, which won't affect the results. 在这种情况下,我们使用
prod
,所以fillval=1
才有意义,这不会影响结果。
Original approach posted by OP - OP发布的原始方法-
def generate_entry(i, j):
result = np.prod([a[s][j][int(p.loc[i][s])] for s in range(D) \
if not np.isnan(p.loc[i][s]) ])
return result
vgenerate_entry = np.vectorize(generate_entry)
Sample run - 样品运行-
In [154]: N,D,K,T = 3,4,5,6
...: a = np.random.randint(0,5,(D,K,T))
...:
...: p = pd.DataFrame(np.random.randint(0,T,(N,D)).astype(float))
...: p.iloc[2,3] = np.nan
...: p.iloc[1,2] = np.nan
...:
In [155]: result = np.fromfunction(vgenerate_entry, shape=(len(p), K), dtype=int)
In [156]: a_indexed_vals = fancy_indexing_avoid_NaNs(p, a)
In [157]: out = a_indexed_vals.prod(2).T
In [158]: np.allclose(out, result)
Out[158]: True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.