简体   繁体   English

熊猫和Numpy Fancy索引

[英]Pandas and Numpy Fancy Indexing

I am having a hard time with the following. 我在以下方面遇到了困难。 I have a pandas N x D dataframe called p with some missing ( NAN ) values. 我有一个名为p的熊猫N x D数据框,但缺少一些( NAN )值。 I have another corresponding array indexed by D x K x T . 我有另一个对应的数组,由D x K x T索引。 I want to make a map of every panda's entry n,d in the data frame to a[d][k][p[n,d]] for all possible k resulting in N x D x K matrix. 我想将所有可能的k的数据框中的每个熊猫条目n,d映射到a[d][k][p[n,d]]以得到N x D x K矩阵。 Can I have some help as to how to do this most efficiently with the Pandas and Numpy library? 关于如何使用Pandas and Numpy库最有效地执行此操作,我可以寻求帮助吗?

I actually then take the N x D part of the final matrix and take the product along the columns leaving an N x K matrix. 然后,我实际上取了最终矩阵的N x D部分,然后沿列取乘积,剩下一个N x K矩阵。 The final output can be (slowly) reproduced by the following: 最终的输出可以(缓慢地)通过以下方式再现:

    def generate_entry(i, j):
        result = np.prod([alpha[s][j][int(p.loc[i][s])] for s in range(num_features) if not isNaN(p.loc[i][s]) ])
        return result

    vgenerate_entry = np.vectorize(generate_entry)
    result = np.fromfunction(vgenerate_entry, shape=(len(p), k), dtype=int)

I think some use of pandas.get_dummies would be helpful for matrix multiplication but I can't quite figure that out. 我认为对pandas.get_dummies某些使用将有助于矩阵乘法,但我不太清楚。

The following is much faster: 以下是更快的方法:

    r = None
    for i in range(num_features):
        rel_data = pd.get_dummies(data.ix[:,i])
        rel_probs = alpha[i].T
        prod = rel_data.dot(rel_probs)
        prod[prod == 0] = 1
        if r is None:
            r = prod
        else:
            r = r.multiply(prod)

    r = r.as_matrix()
    r = r * pi
    posteriers = r / np.sum(r, axis=1)[:, np.newaxis]

Here's one approach to index into the NumPy array a with the pandas dataframe p that has NaNs , which are to be avoided and we are filling some value fillval in those places - 这是一种使用具有NaNs的熊猫数据帧p索引NumPy数组a的一种方法,应避免这种情况,我们在这些地方填充了一些值fillval

def fancy_indexing_avoid_NaNs(p, a, fillval = 1):
    # Extract values from p and get NaN mask
    pv = p.values
    mask = np.isnan(pv)

    # Get int version, replacing NaNs with some number, say 0
    p_idx = np.where(mask, 0, pv).astype(int)

    # FANCY-INDEX into array 'a' with those indices fron p
    a_indexed_vals = a[np.arange(D), np.arange(K)[:,None,None],p_idx]

    # FANCY-INDEX once more to replace the values set by NaNs as 1s, so
    # that in the prod-reduction later on they would have no effect
    a_indexed_vals[np.arange(K)[:,None,None],mask] = fillval
    return a_indexed_vals

That fillval would be application dependent. fillval将取决于应用程序。 In this case, we are using prod , so a fillval=1 makes sense, which won't affect the results. 在这种情况下,我们使用prod ,所以fillval=1才有意义,这不会影响结果。

Original approach posted by OP - OP发布的原始方法-

def generate_entry(i, j):
    result = np.prod([a[s][j][int(p.loc[i][s])] for s in range(D) \
                                   if not np.isnan(p.loc[i][s]) ])
    return result

vgenerate_entry = np.vectorize(generate_entry)

Sample run - 样品运行-

In [154]: N,D,K,T = 3,4,5,6
     ...: a = np.random.randint(0,5,(D,K,T))
     ...: 
     ...: p = pd.DataFrame(np.random.randint(0,T,(N,D)).astype(float))
     ...: p.iloc[2,3] = np.nan
     ...: p.iloc[1,2] = np.nan
     ...: 

In [155]: result = np.fromfunction(vgenerate_entry, shape=(len(p), K), dtype=int)

In [156]: a_indexed_vals = fancy_indexing_avoid_NaNs(p, a)

In [157]: out = a_indexed_vals.prod(2).T

In [158]: np.allclose(out, result)
Out[158]: True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM