简体   繁体   English

pyspark - 根据另一个计算列的计算值更新列

[英]pyspark - Updating a column based on a calculated value from another calculated column

  1. Following code loads data from csv file into dataframe df .以下代码将数据从 csv 文件加载到数据帧df中。 A SQL table myTable corresponding to this df already exists and data will be imported from this df into myTable.这个df对应的SQL表myTable已经存在,数据将从这个df导入到myTable中。
  2. myTable has several columns. myTable 有几列。 Column5 and Column6 exists in myTable and are calculated columns. Column5 和 Column6 存在于 myTable 中,并且是calculated列。 But these columns do not exist in csv file.但是这些列在 csv 文件中不存在。
  3. The values of Column5 are calculated based on values from Column1. Column5 的值是根据 Column1 中的值计算得出的。 And the values of Column6 are calculated based on the calculated values from Column5.并且 Column6 的值是根据 Column5 的计算值计算的。 These values are calculated by testFunction1 and testFunction2 respectivley.这些值分别由 testFunction1 和 testFunction2 计算。
  4. The code works fine for Column5.该代码适用于 Column5。 But throws the following error on the last line .withColumn("Column6", newFunction2(df.Column5)) of the code below.但是在下面代码的最后一行.withColumn("Column6", newFunction2(df.Column5))上抛出以下错误。

Question : What I may be doing wrong here.问题:我在这里可能做错了什么。 And how can we fix the error.以及我们如何修复错误。 Note: If I remove Column6 from the myTable, and remove last line of the code below, the code successfully loads the data into myTable with data in column5 filled (as intended) with calculated values from Column1.注意:如果我从 myTable 中删除 Column6,并删除下面代码的最后一行,则代码成功地将数据加载到 myTable 中,其中 column5 中的数据填充(按预期)使用来自 Column1 的计算值。

Error :错误

AttributeError: 'DataFrame' object has no attribute 'Column6' AttributeError:“DataFrame”对象没有属性“Column6”

Code :代码

from pyspark.sql.types import StringType
from pyspark.sql import functions as F

df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="false")

def testFunction1(Col1Value):
  #do some calculation on column1 value and return it to column5
  return mystr1

def testFunction2(value):
  # do some calculation on column5 value and return it to column6
  return mystr2

newFunction1 = F.udf(testFunction1, StringType())
newFunction2 = F.udf(testFunction2, StringType())

df2 = df.withColumn("Column5", newFunction1(df.Column1)) \
      .withColumn("Column6", newFunction2(df.Column5)) 

The problem is when you are creating df2.问题是当您创建 df2. You are reading the dataframe (df) and creating the column "Column5".您正在读取数据框 (df) 并创建列“Column5”。 Then referencing the column at the second line.然后引用第二行的列。 But "Column5" doesnt exist yet in df.但是 df 中尚不存在“Column5”。 If you break up the last part into two statements such as the code below, it should solve the problem:如果你把最后一部分分成两个语句,如下面的代码,它应该可以解决问题:

df2 = df.withColumn("Column5", newFunction1(df.Column1))
df3 = df2.withColumn("Column6", newFunction2(df.Column5)) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark:防止列值在计算后发生变化 - Pyspark: Prevent Column value from changing once calculated 添加一个基于另一列计算的列(在 Pandas 中) - Add a column (in Pandas) that is calculated based on another column 动态添加列并从另一列分配计算值 - add column dynamically and assign calculated value from another column 从计算值中排除列 - Exclude a column from calculated value 在 Pyspark 中创建一个新列,该列在另一个可用列上计算 - Create a new column in Pyspark which is calculated on another available column 用从数据框中另一列的最大值计算的值替换字符串 - Replacing string with value calculated from the max of another column in a dataframe PySpark:如何在PySpark SQL中创建计算列? - PySpark: How to create a calculated column in PySpark SQL? 使用先前计算的值(来自同一列)和Pandas Dataframe中另一列的值来计算值 - Calculate value using previously-calculated value (from the same column) and value from another column in a Pandas Dataframe 如何从 Pyspark / Python 数据集中先前计算的列中获取值 - How to get value from previous calculated column in Pyspark / Python data set 如何根据计算的同一列中的先前值计算 pandas 列? - How to calculate a pandas column based on the previous value in the same column that is calculated?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM