我可以更改 Spark 数据框中列的可空性吗？

Question

I have a StructField in a dataframe that is not nullable.我在不可为空的数据框中有一个 StructField。 Simple example:简单的例子：

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

which returns:返回：

[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)] [StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]

Notice that the field foo is not nullable.请注意，字段foo不可为空。 Problem is that (for reasons I won't go into) I want it to be nullable.问题是（出于我不会讨论的原因）我希望它可以为空。 I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:我发现这篇文章Change nullable property of column in spark dataframe建议了一种方法，所以我将其中的代码调整为：

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

which failed with:失败了：

TypeError: StructField(name,StringType,true) is not JSON serializable TypeError: StructField(name,StringType,true) 不是 JSON 可序列化的

I also see this in the stack trace:我也在堆栈跟踪中看到了这一点：

raise ValueError("Circular reference detected") raise ValueError("检测到循环引用")

So I'm a bit stuck.所以我有点卡住了。 Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?任何人都可以修改此示例，使我能够定义列foo可为空的数据帧吗？

Answer 1

I know this question is already answered, but I was looking for a more generic solution when I came up with this:我知道这个问题已经得到解答，但是当我想出这个问题时，我正在寻找一个更通用的解决方案：

def set_df_columns_nullable(spark, df, column_list, nullable=True):
    for struct_field in df.schema:
        if struct_field.name in column_list:
            struct_field.nullable = nullable
    df_mod = spark.createDataFrame(df.rdd, df.schema)
    return df_mod

You can then call it like this:然后你可以这样称呼它：

set_df_columns_nullable(spark,df,['name','age'])

Answer 2

Seems you missed the StructType(newSchema).似乎您错过了 StructType(newSchema)。

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()

Answer 3

For the general case, one can change the nullability of a column via the nullable property of the StructField of that specific column.对于一般情况，可以通过特定列的StructField的nullable为nullable属性更改列的nullable为nullable性。 Here's an example:下面是一个例子：

df.schema['col_1']
# StructField(col_1,DoubleType,false)

df.schema['col_1'].nullable = True

df.schema['col_1']
# StructField(col_1,DoubleType,true)

Answer 4

df1 = df.rdd.toDF()
df1.printSchema()

Output:输出：

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- foo: boolean (nullable = true)

我可以更改 Spark 数据框中列的可空性吗？

问题描述

4 个解决方案

解决方案1
10 2018-08-13 11:35:57

解决方案2
5 已采纳 2017-09-06 10:53:28

解决方案3
5 2020-11-25 10:50:50

解决方案4
-1 2017-09-06 10:21:46

我可以更改 Spark 数据框中列的可空性吗？

问题描述

4 个解决方案

解决方案1 10 2018-08-13 11:35:57

解决方案2 5 已采纳 2017-09-06 10:53:28

解决方案3 5 2020-11-25 10:50:50

解决方案4 -1 2017-09-06 10:21:46

解决方案1
10 2018-08-13 11:35:57

解决方案2
5 已采纳 2017-09-06 10:53:28

解决方案3
5 2020-11-25 10:50:50

解决方案4
-1 2017-09-06 10:21:46