简体   繁体   English

MultiIndexing行与pandas DataFrame中的列

[英]MultiIndexing rows vs. columns in pandas DataFrame

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns. 我正在使用pandas中的多索引数据框,我想知道是否应该对行或列进行多重索引。

My data looks something like this: 我的数据看起来像这样: 数据表

Code: 码:

import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'], 
                                          ['patient1', 'patient2'],
                                          ['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays, 
                                    names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)), 
                    index=rowidxs, columns=colidxs)

Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here). 在这里,我选择对列进行多重索引,其基本原理是pandas dataframe由系列组成,而我的数据最终是一堆时间序列(因此在这里按时间行索引)。

I have this question because it seems there is some asymmetry between rows and columns for multiindexing. 我有这个问题,因为多列索引似乎在行和列之间存在一些不对称性。 For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T . 例如,在文档网页中,它显示了query如何为行多索引数据帧工作,但如果数据帧是列df.T.query('color == "red"').T ,则文档中的命令必须替换为df.T.query('color == "red"').T

My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above). 我的问题可能看起来有些愚蠢,但我想看看多索引行与数据帧列之间的便利性是否存在差异(例如上面的query案例)。

Thanks. 谢谢。

A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame: 我称之为DataFrame的一些常见操作的行/列倾向的粗略个人摘要:

  • [] : column-first [] :第一列
  • get : column-only get :仅限列
  • attribute accessing as indexing: column-only 属性访问作为索引:仅列
  • query : row-only query :仅限行
  • loc, iloc, ix : row-first loc, iloc, ix :row-first
  • xs : row-first xs :行优先
  • sortlevel : row-first sortlevel :排在第一位
  • groupby : row-first groupby :排在第一位

"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1 ; “row-first”表示操作期望行索引作为第一个参数,并且要对列索引进行操作,需要使用[:, ]或指定axis=1 ;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index. “row-only”表示该操作仅适用于行索引,而且必须执行类似转置数据帧以对列索引进行操作的操作。

Based on this, it seems multiindexing rows is slightly more convenient. 基于此,似乎多索引行更方便。

A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? 我的一个自然问题:为什么熊猫开发人员不会统一DataFrame操作的行/列倾向? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd. 例如, []loc/iloc/ix是两种最常用的索引数据帧的方法,但是一个切片列和其他切片行看起来有点奇怪。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM