python dataframe as function input and get another dataframe with new name as output

Question

I have a dataframe df with lots of processing on different rows and columns.我有一个 dataframe df ，在不同的行和列上进行了大量处理。 Eventually I'd like to get a new df called eg processed_df .最终，我想获得一个新的 df ，称为例如processed_df 。 This is what I have done:这就是我所做的：

import pandas as pd
import numpy as np

def foofunc(df):
    name =[x for x in globals() if globals()[x] is df][0] # get df name as string
    output_df='processed_'+str(name)
    
    output_df=df.head(2) # e.g as process, in reality is ~ 50 operations
    print(f'output dataframe name is: {str(output_df)})') #expect to get: processed_df
    return output_df

testdf = pd.DataFrame(np.random.randint(0,100,size=(5, 2)), columns=list('AB'))
foofunc(testdf) # expect to get processed_testdf

processed_df

Then here in the last line, I get the error:然后在最后一行，我得到了错误：

NameError: name 'processed_df' is not defined

To be more clear, this is part of a pipeline, so I'd like just to give a df and get out the processed with a desired name.更清楚地说，这是管道的一部分，所以我只想给出一个 df 并使用所需的名称来处理。 In general, is my approach a good practice to do such operations on dataframes?一般来说，我的方法是对数据帧进行此类操作的好习惯吗？

Thank you!谢谢！

Answer 1

I don't see a good reason to have a function auto-generate a name and put its result into the global namespace, when python already binds function results to names.当 python 已经将 function 结果绑定到名称时，我认为没有充分的理由让 function 自动生成名称并将其结果放入全局命名空间。 After that name has been generated, how would another piece of code know what it is called?生成该名称后，另一段代码如何知道它的名称？ And suppose that input df wasn't in the function's global namespace and its global name (or one of its global names if it has multiple references) can't be found?并假设输入df不在函数的全局命名空间中，并且找不到它的全局名称（或者如果它有多个引用，则为它的全局名称之一）？

There are many ways to write a pipeline, the easiest being有很多方法可以编写管道，最简单的是

df = do_thing_1(df)
df = do_thing_2(df)
...

This has the advantage that the caller gets to decide the name.这样做的好处是调用者可以决定名称。 And it gets rid of intermediate dataframes that are otherwise consuming memory.并且它摆脱了消耗 memory 的中间数据帧。

That said, your problem is that you don't assign the result back to the global namespace... and you use the wrong name for the generated dataframe (getting back to that "how do you know what the name is" problem).也就是说，您的问题是您没有将结果分配回全局名称空间......并且您为生成的 dataframe 使用了错误的名称（回到“你怎么知道名字是什么”问题）。 A solution is一个解决方案是

import pandas as pd
import numpy as np

def foofunc(df):
    name =[x for x in globals() if globals()[x] is df][0] # get df name as string
    output_df_name='processed_'+str(name)
    
    output_df=df.head(2) # e.g as process, in reality is ~ 50 operations
    print(f'output dataframe name is: {str(output_df)})') #expect to get: processed_df
    globals()[output_df_name] = output_df

testdf = pd.DataFrame(np.random.randint(0,100,size=(5, 2)), columns=list('AB'))
foofunc(testdf) # expect to get processed_testdf

processed_testdf

python dataframe as function input and get another dataframe with new name as output

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-26 16:34:42

python dataframe as function input and get another dataframe with new name as output

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-26 16:34:42

解决方案1
1 已采纳 2020-11-26 16:34:42