简体   繁体   English

什么是从多个其他变量创建pandas变量的pythonic方法

[英]What is the pythonic way of creating a pandas variable from multiple other variables

I am a R programmer currently trying to learn Python / Pandas. 我是一名R程序员,目前正在学习Python / Pandas。 Currently I am trying to grapple with how to clearly and cleanly create a new variable from a function that uses multiple existing variables. 目前,我正在努力解决如何从使用多个现有变量的函数中清晰而干净地创建新变量的问题。

Note that the function used in my example isn't that complex but I am trying to generalise to the case of an arbitrary function that could be significantly more complex or require more variables, that is to say I am trying to avoid solutions that are optimised for this specific function and more looking how to handle the general scenario. 请注意,我的示例中使用的函数并不复杂,但我试图推广到可能明显更复杂或需要更多变量的任意函数的情况,也就是说我试图避免优化的解决方案对于这个特定的功能 ,更多的是如何处理一般情况。

For reference this is an example of how I would do this in R. 作为参考,这是我将如何在R中执行此操作的示例。

library(tidyverse)

df <- data_frame(
    num = c(15, 52 , 24 , 29),
    cls = c("a" , "b" , "b", "a")
)

attempt1 <- function( num , cls){
    if ( cls == "a") return( num + 10)
    if ( cls == "b") return( num - 10)
}

## Example 1
df %>% 
    mutate( num2 = map2_dbl( num , cls , attempt1))

## Example 2
df %>% 
    mutate( num = ifelse( num <= 25 , num + 10 , num)) %>% 
    mutate( num2 = map2_dbl( num , cls , attempt1))

Reading the pandas documentation as well as various SO posts I have found multiple ways of achieving this in python, however none of them sit well with me. 阅读pandas文档以及各种SO帖子我已经找到了在python中实现这一目标的多种方法,但是没有一个能与我好好相处。 For reference I've posted my current 3 solutions below: 作为参考,我已经发布了以下3个解决方案:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "num" : [14, 52 , 24 , 29],
    "cls" : ["a" , "b" , "b" ,"a"]
})

### Example 1

def attempt1( num, cls):
    if cls == "a":
        return num + 10
    if cls == "b":
        return num - 10

df.assign( num2 = df.apply( lambda x: attempt1(x["num"] , x["cls"]) , axis = 1))


def attempt2( df):
    if df["cls"] == "a":
        return df["num"] + 10
    if df["cls"] == "b":
        return df["num"] - 10

df.assign( num2 = df.apply(attempt2, axis=1))



def attempt3(df):
    df["num2"] = attempt1(df["num"], df["cls"])
    return df

df.apply( attempt3 , axis = 1)



### Example 2

df.assign( num = np.where( df["num"] <= 25 , df["num"] + 10 , df["num"]))\
    .apply( attempt3 , axis = 1)

My issue with attempt 1 is that it appears to be quite horribly verbose. 尝试1的问题在于它看起来非常可怕。 In addition you need to self reference back to your starting dataset which means that if you wanted to chain multiple derivations together you would have to write out your dataset to intermediate variables even if you had no intention of keeping it. 此外,您需要自我引用回到起始数据集,这意味着如果您想要将多个派生链接在一起,则必须将数据集写出到中间变量,即使您无意保留它也是如此。

Attempt2 has significantly cleaner syntax but still suffers from the intermediate variable problem. Attempt2具有明显更清晰的语法,但仍然存在中间变量问题。 Another issue is that the function expects a dataframe which makes the function harder to unittest, less flexible and less clear on what the inputs should be. 另一个问题是该函数需要一个数据帧,这使得函数更难以进行单元测试,灵活性降低,输入应该更加清晰。

Attempt3 seems to be the best to me in terms of functionality as it provides you with a clear testable function and doesn't require the saving of intermediate datasets. Attempt3在功能方面似乎对我来说是最好的,因为它为您提供了清晰的可测试功能,并且不需要保存中间数据集。 The major downside being that you now have to have 2 functions which feels like redundant code. 主要的缺点是你现在必须拥有两个感觉像冗余代码的功能。

Any help or advice would be greatly appreciated. 任何帮助或建议将不胜感激。

You can rely on Series.where to do the job, by creating a column that contains 10 , and changing it to -10 depending on the value of cls . 您可以依靠Series.where来完成这项工作,方法是创建一个包含10的列,并根据cls的值将其更改为-10 You can then use that column to perform an arithmetic operation like you want. 然后,您可以使用该列执行所需的算术运算。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.where.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.where.html

Step by step (verbose) example: 一步一步(详细)示例:

df['what_to_add'] = 10
df['what_to_add'] = df['what_to_add'].where(df['cls'] == 'a', -10)
df['num'] = df['num'] + df['what_to_add']

Another possibility given that your two numbers are opposite is to define a column for the sign of the operand: 两个数字相反的另一种可能性是为操作数的符号定义一列:

df['sign'] = 1 - 2 * (df['cls'] == 'a').astype(int)
df['num'] = df['num'] + df['sign'] * 10

A third way to do that is to use replace , so that you replace "a" by 10 and "b" by -10: 第三种方法是使用replace ,以便将“a”替换为10,将“b”替换为-10:

df['what_to_add'] = df['cls'].replace(['a', 'b'], [10, -10])
df['num'] = df['num'] + df['what_to_add']

edited : Or, as proposed by JPP ( https://stackoverflow.com/a/49748695/4582949 ), using map : 编辑 :或者,正如JPP( https://stackoverflow.com/a/49748695/4582949 )所建议的那样,使用map

df['num2'] += df['cls'].map({'a': 10, 'b': -10})

One efficient method is to use pd.Series.map : 一种有效的方法是使用pd.Series.map

df['num2'] += df['cls'].map({'a': 10, 'b': -10})

This uses a dictionary to map values of cls to either 10 or -10. 这使用字典将cls值映射到10或-10。

There are many other methods (see @Guybrush's answer ), but the dictionary-based method is extendable and efficient for larger dataframes. 还有许多其他方法(请参阅@ Guybrush的答案 ),但基于字典的方法对于较大的数据帧是可扩展且高效的。 In my opinion, it is also readable. 在我看来,它也是可读的。

Relevant: Replace values in a pandas series via dictionary efficiently 相关: 有效地通过字典替换大熊猫系列中的值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM