[英]Benefits of panda's multiindex?
So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.所以我了解到我可以使用 DataFrame.groupby 而不需要 MultiIndex 来进行子采样/横截面。
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.另一方面,当我在 DataFrame 上有一个 MultiIndex 时,我仍然需要使用 DataFrame.groupby 来进行子采样/横截面。
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?那么除了打印时层次结构的非常有用和漂亮的显示之外,MultiIndex 还有什么好处呢?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release. pandas 0.4 版本中引入了分层索引(也称为“多级”索引)。
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data.这为一些非常复杂的数据分析和操作打开了大门,尤其是在处理高维数据时。 In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.例如,本质上,它使您能够有效地存储和操作二维表格结构 (DataFrame) 中的任意高维数据。
Imagine constructing a dataframe using MultiIndex
like this:-想象一下使用MultiIndex
构建一个数据MultiIndex
如下所示:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df
is simply a data structure of two dimensions这个df
只是一个二维的数据结构
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.但我们可以把它想象成一个 3 维数据结构,看看输出。
one
with 1
with data -0.732470 -0.313871
. one
有1
的数据-0.732470 -0.313871
。one
with 2
with data -0.031109 -2.068794
. one
有2
数据-0.031109 -2.068794
。one
with 3
with data 1.520652 0.471764
. one
有3
数据1.520652 0.471764
。Aka: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"又名:“在二维表格结构中有效地存储和操作任意高维数据”
This is not just a "pretty display".这不仅仅是一个“漂亮的展示”。 It has the benefit of easy retrieval of data since we now have a hierarchal index.它具有易于检索数据的好处,因为我们现在有一个分层索引。
For example.例如。
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".将为我们提供一个新的数据框,仅用于属于“一个”的数据组。
And we can narrow down our data selection further by doing this:-我们可以通过这样做进一步缩小我们的数据选择范围:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-当然,如果我们想要一个特定的值,这里有一个例子:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby
.因此,如果我们有更多索引(除了上面示例中显示的 2 个索引),我们基本上可以向下钻取并选择我们真正感兴趣的数据集,而无需groupby
。
We can even grab a cross-section (either rows or columns) from our dataframe...我们甚至可以从我们的数据框中获取横截面(行或列)...
By rows:-按行:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-按列:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by @Calvin Cheng, but thought I'd take a stab at this as well. @Calvin Cheng 的好帖子,但我想我也会尝试一下。
When to use a MultiIndex:何时使用 MultiIndex:
Why (your core question) - at least these are the biggest benefits IMO:为什么(你的核心问题)——至少这些是 IMO 的最大好处:
Example:例子:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store')
to line everything up side-by-side: 1) 如果我们想轻松地比较不同商店的销售额,我们可以使用df.unstack('Store')
将所有内容并排排列:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. 2)我们还可以轻松地对多列进行数学运算。 For example, df['Dollars'] / df['Units']
will then divide each store's dollars by its units, for every store without multiple operations:例如, df['Dollars'] / df['Units']
然后将每个商店的美元除以其单位,对于每个没有多次操作的商店:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the 3)如果我们想过滤到特定的行,而不是使用
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful).格式,我们可以改为 .xs 或 .query (是的,这些适用于常规 dfs,但不是很有用)。 The syntax would instead be:语法改为:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.更多示例可以在我整理的本教程笔记本中找到。
The alternative to using a multiindex is to store your data using multiple columns of a dataframe.使用多索引的替代方法是使用数据帧的多列存储数据。 One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.人们会期望多索引能够比原始列存储提供性能提升,但从 Pandas v 1.1.4 开始,情况似乎并非如此。
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack()
as mentioned in @ZaxR's answer;使用 MultiIndex 的好处在于语法糖、自记录数据以及@ZaxR 的回答中提到的 unstack unstack()
等函数的小便利; Performance is not a benefit, which seems like a real missed opportunity.性能不是好处,这似乎是一个真正错失的机会。
Based on the comment on this answer it seems the experiment was flawed.根据对此答案的评论,该实验似乎存在缺陷。 Here is my attempt at a correct experiment.这是我对正确实验的尝试。
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
MultiIndex single-value retrieval is over 7 times faster than conventional dataframe single-value retrieval. MultiIndex 单值检索比传统的 dataframe 单值检索快 7 倍以上。
The syntax for MultiIndex retrieval is much cleaner. MultiIndex 检索的语法更加简洁。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.