MultiIndexing行与pandas DataFrame中的列

Question

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns. 我正在使用pandas中的多索引数据框，我想知道是否应该对行或列进行多重索引。

My data looks something like this: 我的数据看起来像这样： 数据表

Code: 码：

import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'], 
                                          ['patient1', 'patient2'],
                                          ['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays, 
                                    names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)), 
                    index=rowidxs, columns=colidxs)

Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here). 在这里，我选择对列进行多重索引，其基本原理是pandas dataframe由系列组成，而我的数据最终是一堆时间序列（因此在这里按时间行索引）。

I have this question because it seems there is some asymmetry between rows and columns for multiindexing. 我有这个问题，因为多列索引似乎在行和列之间存在一些不对称性。 For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T . 例如，在本文档网页中，它显示了query如何为行多索引数据帧工作，但如果数据帧是列df.T.query('color == "red"').T ，则文档中的命令必须替换为df.T.query('color == "red"').T 。

My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above). 我的问题可能看起来有些愚蠢，但我想看看多索引行与数据帧列之间的便利性是否存在差异（例如上面的query案例）。

Thanks. 谢谢。

Answer 1

A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame: 我称之为DataFrame的一些常见操作的行/列倾向的粗略个人摘要：

[] : column-first [] ：第一列
get : column-only get ：仅限列
attribute accessing as indexing: column-only 属性访问作为索引：仅列
query : row-only query ：仅限行
loc, iloc, ix : row-first loc, iloc, ix ：row-first
xs : row-first xs ：行优先
sortlevel : row-first sortlevel ：排在第一位
groupby : row-first groupby ：排在第一位

"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1 ; “row-first”表示操作期望行索引作为第一个参数，并且要对列索引进行操作，需要使用[:, ]或指定axis=1 ;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index. “row-only”表示该操作仅适用于行索引，而且必须执行类似转置数据帧以对列索引进行操作的操作。

Based on this, it seems multiindexing rows is slightly more convenient. 基于此，似乎多索引行更方便。

A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? 我的一个自然问题：为什么熊猫开发人员不会统一DataFrame操作的行/列倾向？ For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd. 例如， []和loc/iloc/ix是两种最常用的索引数据帧的方法，但是一个切片列和其他切片行看起来有点奇怪。

MultiIndexing行与pandas DataFrame中的列

问题描述

1 个解决方案

解决方案1
0 2014-02-28 02:46:17

MultiIndexing行与pandas DataFrame中的列

问题描述

1 个解决方案

解决方案1 0 2014-02-28 02:46:17

解决方案1
0 2014-02-28 02:46:17