简体   繁体   English

如何使用字符串(或其他某种元数据)中的逻辑将新列添加到(PySpark)Dataframe?

[英]How do I add new a new column to a (PySpark) Dataframe using logic from a string (or some other kind of metadata)?

I am trying to add a new column to a dataframe dynamically using logic stored elsewhere. 我正在尝试使用存储在其他位置的逻辑将新列动态添加到数据框。

I want to be able to loop over a new column name and the new column logic contained in an array or lsit and use these values as the parameters in the withColumn function. 我希望能够遍历数组或lsit中包含的新列名和新列逻辑,并将这些值用作withColumn函数中的参数。

Using an example dataframe from the titanic dataset I have been attempting to use the exec() function to execute a string creating a new dataframe with a column using logic defined in a string. 使用来自泰坦尼克号数据集的示例数据帧,我一直试图使用exec()函数执行字符串,并使用字符串中定义的逻辑用列创建新的数据帧。

#create the spark titanic dataframe
import pandas as pd

data1 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
         'Name': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
         'Sex': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
         'Survived': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}}

df1 = spark.createDataFrame(pd.DataFrame(data1, columns=data1.keys()))
df1.show()

Below is a function that takes the old dataframe name, new column name and the logic used to calculate the new column. 下面的函数采用旧的数据框名称,新的列名称以及用于计算新列的逻辑。 The function creates the string: df3=df1.withColumn('diff_PassengerId',df1.PassengerId) 该函数创建字符串:df3 = df1.withColumn('diff_PassengerId',df1.PassengerId)

The function then executes the string. 然后该函数执行字符串。

def testfunc(dfname,colname,col_logic):
  print("datafram:",dfname,"colname:",colname,"collogic",col_logic)
  string="df3="+dfname+".withColumn("+"'diff_PassengerId'"+","+col_logic+")"
  print(string)
  return exec(string)

testfunc('df1','diff_PassengerId','df1.PassengerId+1')

df3.show()

I expected a new dataframe df3 to be created with a new column "diff_PassangerId". 我希望使用新列“ diff_PassangerId”创建一个新的数据框df3。

However instead I get the error on execution: 但是相反,我得到执行错误:

NameError: name 'df3' is not defined
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<command-3662686508692761> in <module>()
      9 
     10 #df3=df1.withColumn('diff_PassengerId',df1.PassengerId)
---> 11 df3.show()

NameError: name 'df3' is not defined

I when I use the show() function inside the string ie when string="df3="+dfname+".withColumn("+"'diff_PassengerId'"+","+col_logic+").show()") 我在字符串中使用show()函数时,即string =“ df3 =” + dfname +“。withColumn(” +“'diff_PassengerId'” +“,” + col_logic +“)。show()”)

it will print the dataframe. 它将打印数据框。 Therfore the string is being executed. 因此,正在执行字符串。 However the df3 dataframe is not being created outside of the exec function. 但是,不会在exec函数外部创建df3数据帧。

Any help will be highly appreciated. 任何帮助将不胜感激。 Many thanks. 非常感谢。

One reason might be because you do not pass any dataframe to your function, and then want to execute df3 = dfname.withColumn(...) . 原因之一可能是因为您没有将任何数据帧传递给函数, df3 = dfname.withColumn(...)想要执行df3 = dfname.withColumn(...)

One option is to remove function and do the following 一种选择是删除功能并执行以下操作

dfname,colname,col_logic = 'df1','diff_PassengerId','df1.PassengerId+1'
string="df3="+dfname+".withColumn("+"'diff_PassengerId'"+","+col_logic+")"
exec(string)
df3.show()

Or move execution outside the function: 或将执行移到函数外:

def testfunc(dfname,colname,col_logic):
    print("datafram:",dfname,"colname:",colname,"collogic",col_logic)
    string="df3="+dfname+".withColumn("+"'diff_PassengerId'"+","+col_logic+")"
    print(string)
#     exec(string)
    return string
exec(testfunc('df1','diff_PassengerId','df1.PassengerId+1'))
df3.show()

In both cases you get the following output: 在这两种情况下,您都会得到以下输出:

+-----------+--------+------+--------+----------------+
|PassengerId|    Name|   Sex|Survived|diff_PassengerId|
+-----------+--------+------+--------+----------------+
|          1|    Owen|  male|       0|               2|
|          2|Florence|female|       1|               3|
|          3|   Laina|female|       1|               4|
|          4|    Lily|female|       1|               5|
|          5| William|  male|       0|               6|
+-----------+--------+------+--------+----------------+

Or pass a dataframe to your function as input. 或将数据框作为输入传递给函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM