简体   繁体   English

在python / pandas的多列中应用相似的功能

[英]Applying similar functions across multiple columns in python/pandas

Problem: Given the dataframe below, I'm trying to come up with the code that will apply a function to three distinct columns without having to write three separate function calls. 问题:考虑到下面的数据框,我试图提供将一个函数应用于三个不同列的代码,而不必编写三个单独的函数调用。

The code for the data: 数据代码:

import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
    'days': [365, 365, 213, 318, 71],
    'spend_30day': [22, 241.5, 0, 27321.05, 345],
    'spend_90day': [22, 451.55, 64.32, 27321.05, 566.54],
    'spend_365day': [854.56, 451.55, 211.65, 27321.05, 566.54]}

df = pd.DataFrame(data)
cols = df.columns.tolist()
cols = ['name', 'days', 'spend_30day', 'spend_90day', 'spend_365day']
df = df[cols]
df

The function below will essentially annualize spend; 以下功能实质上将支出年度化; if someone has fewer than, say, 365 days in the "days" column, the following function will tell me what the spend would have been if they had 365 days: 如果某人在“天数”列中少于365天,则以下功能将告诉我,如果他们有365天,支出将是多少:

def annualize_spend_365(row):
    if row['days']/(float(365)) < 1:
        return (row['spend_365day']/(row['days']/float(365)))
    else:
        return row['spend_365day']

Then I apply the function to the particular column: 然后,我将该函数应用于特定的列:

df.spend_365day = df.apply(annualize_spend_365, axis=1).round(2)
df

This works exactly as I want it to for that one column. 这与我希望的那一栏完全一样。 However, I don't want to have to rewrite this for each of the three different "spend" columns (30, 90, 365). 但是,我不想为三个不同的“支出”列(30、90、365)中的每一个都重写它。 I want to be able to write code that will generalize and apply this function to multiple columns in one pass. 我希望能够编写将代码概括化并将此功能一次应用到多列的代码。

I thought I could create lists of the columns and their respective days, use the "zip" function, and nest the function in a for loop, but my attempt below ultimately fails: 我以为可以创建各列及其各自日子的列表,使用“ zip”函数,然后将该函数嵌套在for循环中,但是我在下面的尝试最终失败了:

spend_cols = [df.spend_30day, df.spend_90day, df.spend_365day]
days_list = [30, 90, 365]

for col, day in zip(spend_cols, days_list):
    def annualize_spend(row):
        if (row.days/(float(day)) < 1:
            return (row.col)/((row.days)/float(day))
        else:
            return row.col
    col = df.apply(annualize_spend, axis = 1)

The error: 错误:

AttributeError: ("'Series' object has no attribute 'col'")

I'm not sure why the loop approach is failing. 我不确定循环方法为什么会失败。 Regardless, I'm hoping for guidance on how to generalize function application in pandas. 无论如何,我希望获得有关如何在熊猫中泛化函数应用程序的指导。 Thanks in advance! 提前致谢!

Look at your two function definitions: 查看您的两个函数定义:

def annualize_spend_365(row):
    if row['days']/(float(365)) < 1:
        return (row['spend_365day']/(row['days']/float(365)))
    else:
        return row['spend_365day']

and

#col in [df.spend_30day, df.spend_90day, df.spend_365day]
def annualize_spend(row):
    if (row.days/(float(day)) < 1:
        return (row.col)/((row.days)/float(day))
    else:
        return row.col

See the difference? 看到不同? On the one hand, in the first case you access the fields with explicit field names, and it works. 一方面,在第一种情况下,您使用显式字段名称访问字段,并且该字段有效。 In the second case you try to access row.col , which fails, but in this case col assumes the values of the corresponding fields in df . 在第二种情况下,您尝试访问row.col ,该操作失败,但是在这种情况下, col假定df相应字段的值 Instead try 试一试

spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']

before your loop. 在循环之前。 On the other hand, in the syntax df.days the field name is actually "days", but in df.col the field name is not the string "col", but the value of the string col . 另一方面,在语法df.days ,字段名称实际上是“ days”,但是在df.col ,字段名称不是字符串“ col”,而是字符串col So you might want to use row[col] in the latter case as well. 因此,在后一种情况下,您可能还想使用row[col] And anyway, I'm not sure how wise it is to take col as an output variable inside your loop over col . 而且无论如何,我不确定在col循环中将col作为输出变量是多么明智。


I'm unfamiliar with pandas.DataFrame.apply , but it's probably possible to use a single function definition, which takes the number of days and the field of interest as input variables: 我不熟悉pandas.DataFrame.apply ,但是可能可以使用单个函数定义,该函数将天数和感兴趣的字段作为输入变量:

def annualize_spend(col,day,row):
    if (row['days']/(float(day)) < 1:
        return (row[col])/((row['days'])/float(day))
    else:
        return row[col]

spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
days_list = [30, 90, 365]

for col, day in zip(spend_cols, days_list):
    col = df.apply(lambda row,col=col,day=day: annualize_spend(col,day,row), axis = 1)

The lambda will ensure that only one input parameter of your function is dangling free when it gets apply d. lambda将确保在apply d时,函数的只有一个输入参数自由悬挂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM