简体   繁体   English

具有对称列和索引(行)标签的Pandas数据框

[英]Pandas Dataframe with Symmetrical Column and Index (row) Lables

Problem Statement: 问题陈述:

I have a variety of pandas dataframes that I would like to make symmetrical. 我有各种想对称的熊猫数据框。 Sometimes the row index labels will outnumber the col labels or vice-a-versa. 有时,行索引标签的数量将超过col标签,反之亦然。 In either case, both the row and column labels of the resulting dataframe should be the sorted union of all the labels. 无论哪种情况,结果数据框的行和列标签都应该是所有标签的排序联合。 Any missing data would be filled in with np.nan . 任何丢失的数据将用np.nan填充。

My solution works, but it involves making 3 copies of the dataframe: the original df, the df with col labels filled out, and a df with row labels filled out. 我的解决方案有效,但是它涉及制作数据帧的3个副本:原始df,填充了col标签的df和填充了行标签的df。 Any other solution I've tried results in an incompletely symmetrical matrix. 我尝试过的任何其他解决方案都会导致矩阵不完全对称。 I am looking for help to make my solution more simpler and more 'pythonic'. 我正在寻找帮助,以使我的解决方案更简单,更“ pythonic”。

Setup Asymmetrical Dataframe: 设置非对称数据框:

asym = pd.DataFrame.from_dict(  
         {'row': ['a','b','c','x','y','z','!'],
            'a': [ n, -.8,-.6,-.3, .8, .01,n],
            'b': [-.8,  n, .5, .7,-.9, .01,n],
            'c': [-.6, .5,  n, .3, .1, .01,n],
            'q': [-.3, .7, .3,  n, .2, .01,n],
            'r': [ .8,-.9, .1, .2,  n, .01,n],
            's': [ .01, .01, .01, .01,  .01, n,n],
       }).set_index('row')

Asymmetrical dataframe: 非对称数据框:

不对称df

Notice the column labels are missing "x","y","z","!" 注意,列标签缺少"x","y","z","!" and the row labels are missing "q","r","s" . 并且行标签缺少"q","r","s"

Attempt to make symmetrical: 尝试使对称:

df = asym
c = df.columns
r = df.index
label_union = set(c).union(set(r))

# fill rows with unique labels
df_1 = df.reindex(index=label_union.difference(r).union(set(r)), fill_value=n)
# fill cols with unique labels
df_2 = df_1.reindex(columns=label_union.difference(c).union(set(c)), fill_value=n)

# sort labels
df_2.sort_index(axis=0, inplace=True)
df_2.sort_index(axis=1, inplace=True)

The result below is right, but making three df copies seems unpythonic. 下面的结果是正确的,但是制作三个df副本似乎是不可思议的。 I also want to perform the above code "inplace", as the dataframes I work with are large and numerous. 我还想执行上面的代码“ inplace”,因为我使用的数据帧很大且很多。 Help me find a solution that gives the correct result below without all the df copies. 帮助我找到一个在没有所有df副本的情况下给出正确结果的解决方案。

Symmetrical dataframe: 对称数据框:

对称df

Note on use of "symmetrical": This resulting dataframe is not strictly symmetrical, meaning this matrix is not equal to its transpose. 使用“对称”的注意事项:此结果数据帧不是严格对称的,这意味着此矩阵与其转置不相等。 I am using "symmetrical" to refer specifically to the row and column labels. 我使用“对称”来专门指行和列标签。 The matrix this toy example is emulating a genetic interaction matrix, where rows and columns are genes, and the corresponding value is a score depicting that interaction. 这个玩具示例的矩阵模拟了一个遗传交互矩阵,其中的行和列是基因,而相应的值是描述该交互作用的得分。 To be truly symmetrical, the matrix would imply transitivity, which is not generally the case in genetic interactions. 要真正对称,矩阵将暗示传递性,而在遗传相互作用中通常不是这种情况。

You can reindex the both axes simultaneously with reindex : 您可以使用reindex同时reindex两个轴:

label_union = asym.index.union(asym.columns)
asym = asym.reindex(index=label_union, columns=label_union)

The resulting output: 结果输出:

    !     a     b     c     q     r     s   x   y   z
! NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
a NaN   NaN -0.80 -0.60 -0.30  0.80  0.01 NaN NaN NaN
b NaN -0.80   NaN  0.50  0.70 -0.90  0.01 NaN NaN NaN
c NaN -0.60  0.50   NaN  0.30  0.10  0.01 NaN NaN NaN
q NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
r NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
s NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
x NaN -0.30  0.70  0.30   NaN  0.20  0.01 NaN NaN NaN
y NaN  0.80 -0.90  0.10  0.20   NaN  0.01 NaN NaN NaN
z NaN  0.01  0.01  0.01  0.01  0.01   NaN NaN NaN NaN

Here's a NumPy approach with np.ix_ that eases up creation of a 2D grid of valid indices and rest is just initializing with NaNs and assigning - 这是使用np.ix_的NumPy方法,可以简化有效索引的2D网格的创建,而其余的只是使用NaNs初始化并分配-

c = df.columns
r = df.index

L = np.union1d(c,r)
cols = np.searchsorted( L, c)
rows = np.searchsorted( L, r)
out = np.full((len(L),len(L)), np.nan)
out[np.ix_(rows, cols)] = df.values
df_out = pd.DataFrame(out, columns=L, index=L)

In terms of memory requirements, out would be a view into the output dataframe and as such won't occupy any additional memory. 在存储器需求方面, out将是一个视图到输出数据帧,因此将不会占用任何附加的存储器。

Sample output - 样本输出-

In [556]: df_out
Out[556]: 
    !     a     b     c     q     r     s   x   y   z
! NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
a NaN   NaN -0.80 -0.60 -0.30  0.80  0.01 NaN NaN NaN
b NaN -0.80   NaN  0.50  0.70 -0.90  0.01 NaN NaN NaN
c NaN -0.60  0.50   NaN  0.30  0.10  0.01 NaN NaN NaN
q NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
r NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
s NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
x NaN -0.30  0.70  0.30   NaN  0.20  0.01 NaN NaN NaN
y NaN  0.80 -0.90  0.10  0.20   NaN  0.01 NaN NaN NaN
z NaN  0.01  0.01  0.01  0.01  0.01   NaN NaN NaN NaN

Get the union of the two indexes as you do today, then reindex twice the dataframe with 2 chained transpositions: 像今天一样获得两个索引的并集,然后使用2个链式换位将数据框重新索引两次:

full_idx = asym.index.union(asym.columns)

asym.reindex(full_idx).T.reindex(full_idx).T
Out[116]: 
    !     a     b     c     q     r     s   x   y   z
! NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
a NaN   NaN -0.80 -0.60 -0.30  0.80  0.01 NaN NaN NaN
b NaN -0.80   NaN  0.50  0.70 -0.90  0.01 NaN NaN NaN
c NaN -0.60  0.50   NaN  0.30  0.10  0.01 NaN NaN NaN
q NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
r NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
s NaN   NaN   NaN   NaN   NaN   NaN   NaN NaN NaN NaN
x NaN -0.30  0.70  0.30   NaN  0.20  0.01 NaN NaN NaN
y NaN  0.80 -0.90  0.10  0.20   NaN  0.01 NaN NaN NaN
z NaN  0.01  0.01  0.01  0.01  0.01   NaN NaN NaN NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM