在 spark 中更新数据框列

Question

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns.查看新的spark dataframe api，不清楚是否可以修改dataframe列。

How would I go about changing a value in row x column y of a dataframe?我将如何更改数据框第x列y中的值？

In pandas this would be df.ix[x,y] = new_value在pandas这将是df.ix[x,y] = new_value

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.编辑：合并下面所说的内容，您不能修改现有数据帧，因为它是不可变的，但您可以返回一个具有所需修改的新数据帧。

If you just want to replace a value in a column based on a condition, like np.where :如果您只想根据条件替换列中的值，例如np.where ：

from pyspark.sql import functions as F

update_func = (F.when(F.col('update_col') == replace_val, new_value)
                .otherwise(F.col('update_col')))
df = df.withColumn('new_column_name', update_func)

If you want to perform some operation on a column and create a new column that is added to the dataframe:如果要对列执行某些操作并创建添加到数据框中的新列：

import pyspark.sql.functions as F
import pyspark.sql.types as T

def my_func(col):
    do stuff to column here
    return transformed_value

# if we assume that my_func returns a string
my_udf = F.UserDefinedFunction(my_func, T.StringType())

df = df.withColumn('new_column_name', my_udf('update_col'))

If you want the new column to have the same name as the old column, you could add the additional step:如果您希望新列与旧列具有相同的名称，您可以添加额外的步骤：

df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

Answer 1

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change.虽然您不能像这样修改列，但您可以对列进行操作并返回一个反映该更改的新 DataFrame。 For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only.为此，您首先要创建一个UserDefinedFunction实现要应用的操作，然后有选择地将该函数仅应用于目标列。 In Python:在 Python 中：

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value . new_df现在具有与相同的模式old_df （假设old_df.target_column是类型的StringType以及），但在列中的所有值target_column将new_value 。

Answer 2

Commonly when updating a column, we want to map an old value to a new value.通常在更新列时，我们希望将旧值映射到新值。 Here's a way to do that in pyspark without UDF's:这是在没有 UDF 的 pyspark 中执行此操作的一种方法：

# update df[update_col], mapping old_value --> new_value
from pyspark.sql import functions as F
df = df.withColumn(update_col,
    F.when(df[update_col]==old_value,new_value).
    otherwise(df[update_col])).

Answer 3

DataFrames are based on RDDs. DataFrames基于 RDD。 RDDs are immutable structures and do not allow updating elements on-site. RDD 是不可变的结构，不允许在现场更新元素。 To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map .要更改值，您需要通过使用类似 SQL 的 DSL 或 RDD 操作（例如map转换原始数据帧来创建新的数据帧。

A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science .强烈推荐的幻灯片： Introducing DataFrames in Spark for Large Scale Data Science 。

Answer 4

Just as maasg says you can create a new DataFrame from the result of a map applied to the old DataFrame.正如maasg所说，您可以根据应用于旧 DataFrame 的地图的结果创建新的 DataFrame。 An example for a given DataFrame df with two rows:具有两行的给定 DataFrame df的示例：

val newDf = sqlContext.createDataFrame(df.map(row => 
  Row(row.getInt(0) + SOMETHING, applySomeDef(row.getAs[Double]("y")), df.schema)

Note that if the types of the columns change, you need to give it a correct schema instead of df.schema .请注意，如果列的类型发生变化，则需要为其提供正确的架构而不是df.schema 。 Check out the api of org.apache.spark.sql.Row for available methods: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html查看org.apache.spark.sql.Row的 api 以获取可用方法： https : //spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

[Update] Or using UDFs in Scala: [更新] 或者在 Scala 中使用 UDF：

import org.apache.spark.sql.functions._

val toLong = udf[Long, String] (_.toLong)

val modifiedDf = df.withColumn("modifiedColumnName", toLong(df("columnName"))).drop("columnName")

and if the column name needs to stay the same you can rename it back:如果列名需要保持不变，您可以将其重命名：

modifiedDf.withColumnRenamed("modifiedColumnName", "columnName")

Answer 5

importing col, when from pyspark.sql.functions and updating fifth column to integer(0,1,2) based on the string(string a, string b, string c) into a new DataFrame.导入col，当从pyspark.sql.functions并根据 string(string a, string b, string c) 将第五列更新为 integer(0,1,2) 到新的 DataFrame 中时。

from pyspark.sql.functions import col, when 

data_frame_temp = data_frame.withColumn("col_5",when(col("col_5") == "string a", 0).when(col("col_5") == "string b", 1).otherwise(2))

在 spark 中更新数据框列

问题描述

5 个解决方案

解决方案1
76 已采纳 2015-03-25 13:35:02

解决方案2
50 2015-12-21 22:23:26

解决方案3
13 2015-03-17 21:51:45

解决方案4
11 2015-11-08 21:19:36

解决方案5
4 2020-05-26 15:59:15

在 spark 中更新数据框列

问题描述

5 个解决方案

解决方案1 76 已采纳 2015-03-25 13:35:02

解决方案2 50 2015-12-21 22:23:26

解决方案3 13 2015-03-17 21:51:45

解决方案4 11 2015-11-08 21:19:36

解决方案5 4 2020-05-26 15:59:15

解决方案1
76 已采纳 2015-03-25 13:35:02

解决方案2
50 2015-12-21 22:23:26

解决方案3
13 2015-03-17 21:51:45

解决方案4
11 2015-11-08 21:19:36

解决方案5
4 2020-05-26 15:59:15