简体   繁体   English

panda的多索引的好处?

[英]Benefits of panda's multiindex?

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.所以我了解到我可以使用 DataFrame.groupby 而不需要 MultiIndex 来进行子采样/横截面。

On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.另一方面,当我在 DataFrame 上有一个 MultiIndex 时,我仍然需要使用 DataFrame.groupby 来进行子采样/横截面。

So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?那么除了打印时层次结构的非常有用和漂亮的显示之外,MultiIndex 还有什么好处呢?

Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release. pandas 0.4 版本中引入了分层索引(也称为“多级”索引)。

This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data.这为一些非常复杂的数据分析和操作打开了大门,尤其是在处理高维数据时。 In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.例如,本质上,它使您能够有效地存储和操作二维表格结构 (DataFrame) 中的任意高维数据。

Imagine constructing a dataframe using MultiIndex like this:-想象一下使用MultiIndex构建一个数据MultiIndex如下所示:-

import pandas as pd
import numpy as np

np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]

df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])

df  # This is the dataframe we have generated

          A         B
one 1 -0.732470 -0.313871
    2 -0.031109 -2.068794
    3  1.520652  0.471764
two 1 -0.101713 -1.204458
    2  0.958008 -0.455419
    3 -0.191702 -0.915983

This df is simply a data structure of two dimensions这个df只是一个二维的数据结构

df.ndim

2

But we can imagine it, looking at the output, as a 3 dimensional data structure.但我们可以把它想象成一个 3 维数据结构,看看输出。

  • one with 1 with data -0.732470 -0.313871 . one1的数据-0.732470 -0.313871
  • one with 2 with data -0.031109 -2.068794 . one2数据-0.031109 -2.068794
  • one with 3 with data 1.520652 0.471764 . one3数据1.520652 0.471764

Aka: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"又名:“在二维表格结构中有效地存储和操作任意高维数据”

This is not just a "pretty display".这不仅仅是一个“漂亮的展示”。 It has the benefit of easy retrieval of data since we now have a hierarchal index.它具有易于检索数据的好处,因为我们现在有一个分层索引。

For example.例如。

In [44]: df.ix["one"]
Out[44]: 
          A         B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3  1.520652  0.471764

will give us a new data frame only for the group of data belonging to "one".将为我们提供一个新的数据框,仅用于属于“一个”的数据组。

And we can narrow down our data selection further by doing this:-我们可以通过这样做进一步缩小我们的数据选择范围:-

In [45]: df.ix["one"].ix[1]
Out[45]: 
A   -0.732470
B   -0.313871
Name: 1

And of course, if we want a specific value, here's an example:-当然,如果我们想要一个特定的值,这里有一个例子:-

In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727

So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby .因此,如果我们有更多索引(除了上面示例中显示的 2 个索引),我们基本上可以向下钻取并选择我们真正感兴趣的数据集,而无需groupby

We can even grab a cross-section (either rows or columns) from our dataframe...我们甚至可以从我们的数据框中获取横截面(行或列)...

By rows:-按行:-

In [47]: df.xs('one')
Out[47]: 
          A         B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3  1.520652  0.471764

By columns:-按列:-

In [48]: df.xs('B', axis=1)
Out[48]: 
one  1   -0.313871
     2   -2.068794
     3    0.471764
two  1   -1.204458
     2   -0.455419
     3   -0.915983
Name: B

Great post by @Calvin Cheng, but thought I'd take a stab at this as well. @Calvin Cheng 的好帖子,但我想我也会尝试一下。

When to use a MultiIndex:何时使用 MultiIndex:

  1. When a single column's value isn't enough to uniquely identify a row.当单个列的值不足以唯一标识一行时。
  2. When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”当数据在逻辑上是分层的——这意味着它有多个维度或“级别”。

Why (your core question) - at least these are the biggest benefits IMO:为什么(你的核心问题)——至少这些是 IMO 的最大好处:

  1. Easy manipulation via stack() and unstack()通过 stack() 和 unstack() 轻松操作
  2. Easy math when there are multiple column levels有多个列级别时的简单数学运算
  3. Syntactic sugar for slicing/filtering用于切片/过滤的语法糖

Example:例子:

                                                       Dollars  Units
Date       Store   Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol  Liqour      80480280024    154.77      7
           Store 2 Alcohol  Liqour      80480280024     82.08      4
           Store 3 Alcohol  Liqour      80480280024    259.38      9
           Store 1 Alcohol  Liquor      80432400630    477.68     14
                                        674545000001   139.68      4
           Store 2 Alcohol  Liquor      80432400630    203.88      6
                                        674545000001   377.13     13
           Store 3 Alcohol  Liquor      80432400630    239.19      7
                                        674545000001   432.32     14
           Store 1 Beer     Ales        94922755711     65.17      7
                                        702770082018   174.44     14
                                        736920111112    50.70      5
           Store 2 Beer     Ales        94922755711    129.60     12
                                        702770082018   107.40     10
                                        736920111112    59.65      5
           Store 3 Beer     Ales        94922755711    154.00     14
                                        702770082018   137.40     10
                                        736920111112   107.88     12
           Store 1 Beer     Lagers      702770081011   156.24     12
           Store 2 Beer     Lagers      702770081011   137.06     11
           Store 3 Beer     Lagers      702770081011   119.52      8    

1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side: 1) 如果我们想轻松地比较不同商店的销售额,我们可以使用df.unstack('Store')将所有内容并排排列:

                                             Dollars                   Units
Store                                        Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date       Category Subcategory UPC EAN
2018-07-10 Alcohol  Liqour      80480280024   154.77   82.08  259.38       7       4       9
                    Liquor      80432400630   477.68  203.88  239.19      14       6       7
                                674545000001  139.68  377.13  432.32       4      13      14
           Beer     Ales        94922755711    65.17  129.60  154.00       7      12      14
                                702770082018  174.44  107.40  137.40      14      10      10
                                736920111112   50.70   59.65  107.88       5       5      12
                    Lagers      702770081011  156.24  137.06  119.52      12      11       8

2) We can also easily do math on multiple columns. 2)我们还可以轻松地对多列进行数学运算。 For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:例如, df['Dollars'] / df['Units']然后将每个商店的美元除以其单位,对于每个没有多次操作的商店:

Store                                         Store 1  Store 2  Store 3
Date       Category Subcategory UPC EAN
2018-07-10 Alcohol  Liqour      80480280024     22.11    20.52    28.82
                    Liquor      80432400630     34.12    33.98    34.17
                                674545000001    34.92    29.01    30.88
           Beer     Ales        94922755711      9.31    10.80    11.00
                                702770082018    12.46    10.74    13.74
                                736920111112    10.14    11.93     8.99
                    Lagers      702770081011    13.02    12.46    14.94

3) If we then want to filter to just specific rows, instead of using the 3)如果我们想过滤到特定的行,而不是使用

df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]

format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful).格式,我们可以改为 .xs 或 .query (是的,这些适用于常规 dfs,但不是很有用)。 The syntax would instead be:语法改为:

df.xs((val1, val2, val3), level=(col1, col2, col3))

More examples can be found in this tutorial notebook I put together.更多示例可以在我整理的本教程笔记本中找到。

The alternative to using a multiindex is to store your data using multiple columns of a dataframe.使用多索引的替代方法是使用数据帧的多列存储数据。 One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.人们会期望多索引能够比原始列存储提供性能提升,但从 Pandas v 1.1.4 开始,情况似乎并非如此。

Timinigs时明

import numpy as np
import pandas as pd

np.random.seed(2020)
inv = pd.DataFrame({
    'store_id': np.random.choice(10000, size=10**7),
    'product_id': np.random.choice(1000, size=10**7),
    'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
                     stock
store_id product_id       
0        2              48
         4              18
         5              58
         7             149
         8             158
...                    ...
9999     992           132
         995           121
         996           105
         998            99
         999            16

[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
         store_id  product_id  stock
0               0           2     48
1               0           4     18
2               0           5     58
3               0           7    149
4               0           8    158
...           ...         ...    ...
6321864      9999         992    132
6321865      9999         995    121
6321866      9999         996    105
6321867      9999         998     99
6321868      9999         999     16

[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop

%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop

%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop

%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop

%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop

%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop

Conclusion结论

The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in @ZaxR's answer;使用 MultiIndex 的好处在于语法糖、自记录数据以及@ZaxR 的回答中提到的 unstack unstack()等函数的小便利; Performance is not a benefit, which seems like a real missed opportunity.性能不是好处,这似乎是一个真正错失的机会。

Based on the comment on this answer it seems the experiment was flawed.根据对此答案的评论,该实验似乎存在缺陷。 Here is my attempt at a correct experiment.这是我对正确实验的尝试。

Timings计时

import pandas as pd
import numpy as np
from timeit import timeit


random_data = np.random.randn(16, 4)

multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)

dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")

dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))

df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")

print("dfm_selection timeit:",
      timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
      timeit(
          lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
          number=int(1e6)))
dfm:
             0         1         2         3
A 1 -1.055128 -0.845019 -2.853027  0.521738
  2  0.397804  0.385045 -0.121294 -0.696215
  3 -0.551836 -0.666953 -0.956578  1.929732
  4 -0.154780  1.778150  0.183104 -0.013989
B 1 -0.315476  0.564419  0.492496 -1.052432
  2 -0.695300  0.085265  0.701724 -0.974168
  3 -0.879915 -0.206499  1.597701  1.294885
  4  0.653261  0.279641 -0.800613  1.050241
C 1  1.004199 -1.377520 -0.672913  1.491793
  2 -0.453452  0.367264 -0.002362  0.411193
  3  2.271958  0.240864 -0.923934 -0.572957
  4  0.737893 -0.523488  0.485497 -2.371977
D 1  1.133661 -0.584973 -0.713320 -0.656315
  2 -1.173231 -0.490667  0.634677  1.711015
  3 -0.050371 -0.175644  0.124797  0.703672
  4  1.349595  0.122202 -1.498178  0.013391

df
    level_0  level_1         0         1         2         3
0        A        1 -1.055128 -0.845019 -2.853027  0.521738
1        A        2  0.397804  0.385045 -0.121294 -0.696215
2        A        3 -0.551836 -0.666953 -0.956578  1.929732
3        A        4 -0.154780  1.778150  0.183104 -0.013989
4        B        1 -0.315476  0.564419  0.492496 -1.052432
5        B        2 -0.695300  0.085265  0.701724 -0.974168
6        B        3 -0.879915 -0.206499  1.597701  1.294885
7        B        4  0.653261  0.279641 -0.800613  1.050241
8        C        1  1.004199 -1.377520 -0.672913  1.491793
9        C        2 -0.453452  0.367264 -0.002362  0.411193
10       C        3  2.271958  0.240864 -0.923934 -0.572957
11       C        4  0.737893 -0.523488  0.485497 -2.371977
12       D        1  1.133661 -0.584973 -0.713320 -0.656315
13       D        2 -1.173231 -0.490667  0.634677  1.711015
14       D        3 -0.050371 -0.175644  0.124797  0.703672
15       D        4  1.349595  0.122202 -1.498178  0.013391 

dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection:  1.0502406808918188 <class 'numpy.float64'> 

dfm_selection timeit: 63.92458086000079
df_selection timeit:  450.4555013199997

Conclusion结论

MultiIndex single-value retrieval is over 7 times faster than conventional dataframe single-value retrieval. MultiIndex 单值检索比传统的 dataframe 单值检索快 7 倍以上。

The syntax for MultiIndex retrieval is much cleaner. MultiIndex 检索的语法更加简洁。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM