简体   繁体   English

向 Spark DataFrame 添加一个空列

[英]Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward.正如网络上的许多其他位置所述,向现有 DataFrame 添加新列并不简单。 Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrame s using unionAll .不幸的是,试图连接两个特别是当有这样的功能(即使它是低效的在分布式环境中)是很重要的DataFrame S使用unionAll

What is the most elegant workaround for adding a null column to a DataFrame to facilitate a unionAll ?null列添加到DataFrame以促进unionAll的最优雅的解决方法是unionAll

My version goes like this:我的版本是这样的:

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

All you need here is a literal and cast:这里你需要的只是一个文字和演员:

from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:一个完整的例子:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))
new_df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)
##  |-- new_column: string (nullable = true)

new_df.show()

## +---+---+----------+
## |foo|bar|new_column|
## +---+---+----------+
## |  1|  2|      null|
## |  2|  3|      null|
## +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values可以在此处找到 Scala 等效项: Create new Dataframe with empty/null field values

I would cast lit(None) to NullType instead of StringType.我会将 lit(None) 转换为 NullType 而不是 StringType。 So that if we ever have to filter out not null rows on that column...it can be easily done as follows因此,如果我们必须过滤掉该列上的非空行...可以按如下方式轻松完成

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()

new_df = df.withColumn('new_column', lit(None).cast(NullType()))

new_df.printSchema() 

df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").如果您要转换为 StringType,请注意不要使用 lit("None")(带引号),因为它会无法在 col("new_column") 上搜索具有过滤条件 .isNull() 的记录。

The option without import StringType没有import StringType的选项

df = df.withColumn('foo', F.lit(None).cast('string'))

Full example:完整示例:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.range(1, 3).toDF('c')
df = df.withColumn('foo', F.lit(None).cast('string'))

df.printSchema()
#     root
#      |-- c: long (nullable = false)
#      |-- foo: string (nullable = true)

df.show()
#     +---+----+
#     |  c| foo|
#     +---+----+
#     |  1|null|
#     |  2|null|
#     +---+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM