将函数应用于MultiIndex pandas.DataFrame列

Question

I have a MultiIndex pandas DataFrame in which I want to apply a function to one of its columns and assign the result to that same column. 我有一个MultiIndex pandas DataFrame，我想在其中的一个列中应用一个函数，并将结果分配给同一列。

In [1]:
    import numpy as np
    import pandas as pd
    cols = ['One', 'Two', 'Three', 'Four', 'Five']
    df = pd.DataFrame(np.array(list('ABCDEFGHIJKLMNO'), dtype='object').reshape(3,5), index = list('ABC'), columns=cols)
    df.to_hdf('/tmp/test.h5', 'df')
    df = pd.read_hdf('/tmp/test.h5', 'df')
    df
Out[1]:
         One     Two     Three  Four    Five
    A    A       B       C      D       E
    B    F       G       H      I       J
    C    K       L       M      N       O
    3 rows × 5 columns

In [2]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df['L']['Five'] = df['L']['Five'].apply(lambda x: x.lower())
    df
-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead 
Out[2]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [3]:
    df.columns = ['One', 'Two', 'Three', 'Four', 'Five']
    df    
Out[3]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [4]:
    df['Five'] = df['Five'].apply(lambda x: x.upper())
    df
Out[4]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

As you can see, the function is not applied to the column, I guess because I get this warning: 正如您所看到的，该函数未应用于列，我猜是因为我收到此警告：

-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

What is strange is that this error only happens sometimes, and I haven't been able to understand when does it happens and when not. 奇怪的是，这个错误有时只会发生，我无法理解它何时发生，何时不发生。

I managed to apply the function slicing the dataframe with .loc as the warning recommended: 我设法应用函数切片数据框与.loc作为建议的警告：

In [5]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df.loc[:,('L','Five')] = df.loc[:,('L','Five')].apply(lambda x: x.lower())
    df

Out[5]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       e
    B    F      G       H       I       j
    C    K      L       M       N       o
    3 rows × 5 columns

but I would like to understand why this behavior happens when doing dict-like slicing (eg df['L']['Five'] ) and not when using the .loc slicing. 但是我想理解为什么在进行类似dict的切片时会发生这种行为（例如df['L']['Five'] ）而不是在使用.loc切片时。

NOTE : The DataFrame comes from an HDF file which was not multiindexed is this perhaps the cause of the strange behavior? 注意：DataFrame来自一个没有多索引的HDF文件，这可能是奇怪行为的原因？

EDIT : I'm using Pandas v.0.13.1 and NumPy v.1.8.0 编辑：我正在使用Pandas v.0.13.1和NumPy v.1.8.0

Answer 1

df['L']['Five'] is selecting the level 0 with the value 'L' and returning a DataFrame, which then the column 'Five' is selected, returning the accessed series. df['L']['Five']选择值为“L”的0级并返回DataFrame，然后选择“Five”列，返回被访问的序列。

The __getitem__ accessor for a Dataframe (the [] ), will try to do the right thing, and gives you the correct column. Dataframe的__getitem__访问器（ [] ）将尝试做正确的事情，并为您提供正确的列。 However, this is chained indexing, see here 但是，这是链式索引，请参见此处

To access a multi-index, use the tuple notation, ('a','b') and .loc which is unambiguous, eg df.loc[:,('a','b')] . 要访问多索引，请使用明确的元组符号('a','b')和.loc ，例如df.loc[:,('a','b')] 。 Furthermore this allows multi-axes indexing at the same time (eg rows AND columns). 此外，这允许同时进行多轴索引（例如，行和列）。

So, why does this not work when you do chained indexing and assignement, eg df['L']['Five'] = value . 那么，当你进行链式索引和分配时，为什么这不起作用，例如df['L']['Five'] = value 。

df['L'] rerturns a data frame that is singly-indexed. df['L']重新生成单索引的数据帧。 Then another python operation df_with_L['Five'] selects the series index by 'Five' happens. 然后另一个python操作df_with_L['Five']选择'Five'发生的系列索引。 I indicated this by another variable. 我用另一个变量指出了这个。 Because pandas sees these operations as separate events (eg separate calls to __getitem__ , so it has to treat them as linear operations, they happen one after another. 因为pandas将这些操作视为单独的事件（例如，单独调用__getitem__ ，所以它必须将它们视为线性操作，它们会一个接一个地发生。

Contrast this to df.loc[:,('L','Five')] which passes a nested tuple of (:,('L','Five')) to a single call to __getitem__ . 与此不同， df.loc[:,('L','Five')] ，其通过的嵌套元组(:,('L','Five'))到一个调用__getitem__ 。 This allows pandas to deal with this as a single entity (and fyi be quite a bit faster because it can directly index into the frame). 这允许pandas将其作为单个实体来处理（并且因为它可以直接索引到帧中，因此fyi会快得多）。

Why does this matter? 为什么这很重要？ Since the chained indexing is 2 calls, it is possible that either call may return a copy of the data because of the way it is sliced. 由于链式索引是2个调用，因此任何一个调用都可能因为切片的方式而返回数据的副本。 Thus when setting this you are actually setting a copy, and not the original frame. 因此，在设置此项时，您实际上是在设置副本，而不是原始帧。 It is impossible for pandas to figure this out because their are 2 separate python operations that are not connected. 大熊猫不可能弄清楚这一点，因为它们是两个没有连接的独立python操作。

The SettingWithCopy warning is a 'heuristic' to detect this (meaning it tends to catch most cases by is simply a lightweight check). SettingWithCopy警告是一个'启发式'来检测这个（意味着它往往会捕获大多数情况只是一个轻量级检查）。 Figuring this out for real is way complicated. 真实地解决这个问题很复杂。

The .loc operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified thus setting the values as you would think. .loc操作是一个单独的python操作，因此可以选择一个切片（仍然可以是一个副本），但允许pandas在修改后将该切片分配回帧中，从而按照您的想法设置值。

The reason for the warning, is this. 这是警告的原因。 Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. 有时当您对数组进行切片时，您只需返回一个视图，这意味着您可以设置它没有问题。 However, even a single dtyped array can generate a copy if sliced in a particular way. 但是，即使单个 dtyped数组也可以生成副本（如果以特定方式切片）。 A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. 多重数据格式（意味着它说浮点数和对象数据）几乎总是会产生副本。 Whether a view is created is dependent on the memory layout of the array. 是否创建视图取决于阵列的内存布局。

Note: this doesn't have anything to do with the source of the data. 注意：这与数据源无关。

将函数应用于MultiIndex pandas.DataFrame列

问题描述

1 个解决方案

解决方案1
6 已采纳 2014-04-08 11:28:08

将函数应用于MultiIndex pandas.DataFrame列

问题描述

1 个解决方案

解决方案1 6 已采纳 2014-04-08 11:28:08

解决方案1
6 已采纳 2014-04-08 11:28:08