简体   繁体   English

如何通过引用修改 Pandas DataFrame?

[英]How can I modify a Pandas DataFrame by reference?

I'm trying to write a Python function that does One-Hot encoding in-place but I'm having trouble finding a way to do a concat operation in-place at the end.我正在尝试编写一个 Python function 就地进行 One-Hot 编码,但我无法找到在最后就地进行连接操作的方法。 It appears to make a copy of my DataFrame for the concat output and I am unable to assign this to my DataFrame that I passed by reference.它似乎为 concat output 复制了我的 DataFrame,我无法将其分配给我通过引用传递的 DataFrame。

How can this be done?如何才能做到这一点?

def one_hot_encode(df, col: str):
     """One-Hot encode inplace. Includes NAN.

     Keyword arguments:
     df (DataFrame) -- the DataFrame object to modify
     col (str) -- the column name to encode
     """

     insert_loc = df.columns.get_loc(col)
     insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)

     df.drop(col, axis=1, inplace=True)
     df[:] = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1) # Doesn't take effect outside function

To make the change take affect outside the function, we have to change the object that was passed in rather than replace its name (inside the function) with a new object.为了使更改在 function 之外生效,我们必须更改传入的 object,而不是用新的 object 替换其名称(在函数内部)。

To assign the new columns, you can use要分配新列,您可以使用

df[insert_data.columns] = insert_data

instead of the concat.而不是连接。

That doesn't take advantage of your careful insert order though.但是,这并没有利用您仔细的插入顺序。 To retain your order, we can redindex the data frame.为了保留您的订单,我们可以重新索引数据框。

df.reindex(columns=cols)

where cols is the combined list of columns in order:其中 cols 是按顺序排列的列的组合列表:

cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]

Putting it all together,把这一切放在一起,

import pandas as pd

def one_hot_encode(df, col: str):
    """One-Hot encode inplace. Includes NAN.

    Keyword arguments:
    df (DataFrame) -- the DataFrame object to modify
    col (str) -- the column name to encode
    """

    cols = list(df.columns)
    insert_loc = df.columns.get_loc(col)
    insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)

    cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]
    df[insert_data.columns] = insert_data
    df.reindex(columns=cols)
    df.drop(col, axis=1, inplace=True)


import seaborn

diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode(diamonds, "color")

assert( "color" not in diamonds.columns ) 
assert( len([c for c in diamonds.columns if c.startswith("color")]) == 8 )

I don't think you can pass function arguments by reference in python (see: How do I pass a variable by reference? )我认为您不能通过 python 中的引用传递 function arguments (请参阅: 如何通过引用传递变量?

Instead what you can do is just return the modified df from your function, and assign result to the original df :相反,您可以做的只是从 function 返回修改后的df ,并将结果分配给原始df

def one_hot_encode(df, col: str):
    ...
    return df

...
df=one_hot_encode(df, col)

df.insert is inplace--but can only insert one column at a time. df.insert 是就地的——但一次只能插入一列。 It might not be worth the reorder.重新订购可能不值得。

def one_hot_encode2(df, col: str):
    """One-Hot encode inplace. Includes NAN.

    Keyword arguments:
    df (DataFrame) -- the DataFrame object to modify
    col (str) -- the column name to encode
    """

    cols = list(df.columns)
    insert_loc = df.columns.get_loc(col)
    insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)

    for offset, newcol in enumerate(insert_data.columns):
        df.insert(loc=insert_loc+offset, column=newcol, value = insert_data[[newcol]])

    df.drop(col, axis=1, inplace=True)


import seaborn

diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode2(diamonds, "color")

assert( "color" not in diamonds.columns ) 
assert(len([c for c in diamonds.columns if c.startswith("color")]) == 8)

assert([(i) for i,c in enumerate(diamonds.columns) if c.startswith("color")][0] == 2)

The scope of the variables of a function are only inside that function. function的变量的scope只在function里面。 Simply include a return statement in the end of the function to get your modified dataframe as output.只需在 function 末尾包含一个 return 语句,即可将修改后的 dataframe 设为 output。 Calling this function will now return your modified dataframe.现在调用此 function 将返回您修改后的 dataframe。 Also while assigning new (dummy) columns, instead of df[:] use df, as you are changing the dimension of original dataframe.此外,在分配新(虚拟)列时,使用 df 而不是 df[:],因为您正在更改原始 dataframe 的尺寸。

def one_hot_encode(df, col: str):
    insert_loc = df.columns.get_loc(col)
    insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True) 
    df.drop(col, axis=1, inplace=True)
    df = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1) 
    return df

Now to see the modified dataframe, call the function and assign it to a new/existing dataframe as below现在查看修改后的 dataframe,调用 function 并将其分配给新的/现有的 dataframe,如下所示

df=one_hot_encode(df,'<any column name>')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM