如何使用多索引将新列分配给 pandas DataFrame 的切片？

Question

我有一个 pandas DataFrame 具有这样的多索引：

import pandas as pd
import numpy as np

arr = [1]*3 + [2]*3
arr2 = list(range(3)) + list(range(3))
mux = pd.MultiIndex.from_arrays([
    arr,
    arr2
], names=['one', 'two'])

df = pd.DataFrame({'a': np.arange(len(mux))}, mux)
df

我有一个 function 需要一个 DataFrame 的切片，并且需要为已切片的行分配一个新列：

def work(df):
    b = df.copy()

    #do some work on the slice and create values for a new column of the slice
    b['b'] = b['a']*2

    #assign the new values back to the slice in a new column
    df['b'] = b['b']

#pass in a slice of the df with only records that have the last value for 'two'
work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])

但是调用 function 会导致错误：

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()

如何在原始 DataFrame 中创建一个新列“b”，并仅将其值分配给传递给 function 的行，留下 rest 的行？

所需的 output 是：

        a   b
one two 
1   0   0   nan
1   1   1   nan
1   2   2   4
2   0   3   nan
2   1   4   nan
2   2   5   10

注意：在工作 function 中，我实际上正在执行一系列复杂的操作，包括调用其他函数来生成新列的值，所以我认为这不会起作用。 在我的示例中乘以 2 仅用于说明目的。

Answer 1

您实际上没有错误，而只是警告。 尝试这个：

def work(df):
    b = df.copy()

    #do some work on the slice and create values for a new column of the slice
    b['b'] = b['a']*2

    #assign the new values back to the slice in a new column
    df['b'] = b['b']
    return df

#pass in a slice of the df with only records that have the last value for 'two'
new_df = work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])

然后：

df.reset_index().merge(new_df, how="left").set_index(["one","two"])

Output：

          a     b
one two     
1   0      0    NaN
    1      1    NaN
    2      2    4.0
2   0      3    NaN
    1      4    NaN
    2      5    10.0

Answer 2

我认为您根本不需要单独的 function。 尝试这个...

df['b'] = df['a'].where(df.index.isin(df.index.get_level_values('two')[-1:], level=1))*2

此处在df['a']上调用的Series.where() function 应该返回一个系列，其中对于不是由您的查询产生的行的值为NaN 。

如何使用多索引将新列分配给 pandas DataFrame 的切片？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-11 17:04:36

解决方案2
0 2020-06-11 17:04:02

如何使用多索引将新列分配给 pandas DataFrame 的切片？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-11 17:04:36

解决方案2 0 2020-06-11 17:04:02

解决方案1
1 已采纳 2020-06-11 17:04:36

解决方案2
0 2020-06-11 17:04:02