[英]MultiIndexing rows vs. columns in pandas DataFrame
I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns. 我正在使用pandas中的多索引数据框,我想知道是否应该对行或列进行多重索引。
My data looks something like this: 我的数据看起来像这样:
Code: 码:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here). 在这里,我选择对列进行多重索引,其基本原理是pandas dataframe由系列组成,而我的数据最终是一堆时间序列(因此在这里按时间行索引)。
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. 我有这个问题,因为多列索引似乎在行和列之间存在一些不对称性。 For example, in this document webpage it shows how query
works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T
. 例如,在本文档网页中,它显示了query
如何为行多索引数据帧工作,但如果数据帧是列df.T.query('color == "red"').T
,则文档中的命令必须替换为df.T.query('color == "red"').T
。
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query
case above). 我的问题可能看起来有些愚蠢,但我想看看多索引行与数据帧列之间的便利性是否存在差异(例如上面的query
案例)。
Thanks. 谢谢。
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame: 我称之为DataFrame的一些常见操作的行/列倾向的粗略个人摘要:
[]
: column-first []
:第一列 get
: column-only get
:仅限列 query
: row-only query
:仅限行 loc, iloc, ix
: row-first loc, iloc, ix
:row-first xs
: row-first xs
:行优先 sortlevel
: row-first sortlevel
:排在第一位 groupby
: row-first groupby
:排在第一位 "row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ]
or specify axis=1
; “row-first”表示操作期望行索引作为第一个参数,并且要对列索引进行操作,需要使用[:, ]
或指定axis=1
;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index. “row-only”表示该操作仅适用于行索引,而且必须执行类似转置数据帧以对列索引进行操作的操作。
Based on this, it seems multiindexing rows is slightly more convenient. 基于此,似乎多索引行更方便。
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? 我的一个自然问题:为什么熊猫开发人员不会统一DataFrame操作的行/列倾向? For example, that []
and loc/iloc/ix
are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd. 例如, []
和loc/iloc/ix
是两种最常用的索引数据帧的方法,但是一个切片列和其他切片行看起来有点奇怪。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.