How to create a new column with a null value using Pyspark DataFrame?

Question

I'm having issues with using pyspark dataframes. I have a column called eventkey which is a concatenation of the following elements: account_type , counter_type and billable_item_sid . I have a function called apply_event_key_transform in which I want to break up the concatenated eventkey and create new columns for each of the elements.

def apply_event_key_transform(data_frame: DataFrame):

    output_df = data_frame.withColumn("account_type", getAccountTypeUDF(data_frame.eventkey)) \
        .withColumn("counter_type", getCounterTypeUDF(data_frame.eventkey)) \
        .withColumn("billable_item_sid", getBiSidUDF(data_frame.eventkey))
    output_df.drop("eventkey")
    return output_df

I've created UDF functions to retrieve the account_type , counter_type and billable_item_sid from a given eventkey value. I have a class called EventKey that takes the full eventkey string as a constructor param, and creates an object with data members to access the account_type , counter_type and billable_item_sid .

getAccountTypeUDF = udf(lambda x: get_account_type(x))
getCounterTypeUDF = udf(lambda x: get_counter_type(x))
getBiSidUDF = udf(lambda x: get_billable_item_sid(x))


def get_account_type(event_key: str):
    event_key_obj = EventKey(event_key)
    return event_key_obj.account_type.name


def get_counter_type(event_key: str):
    event_key_obj = EventKey(event_key)
    return event_key_obj.counter_type


def get_billable_item_sid(event_key: str):
    event_key_obj = EventKey(event_key)
    return event_key_obj.billable_item_sid

The issue that I'm running into is that a billable_item_sid can be null, but when I attempt to call withColumn with a None, the entire frame drops the column when I attempt to aggregate the data later. Is there a way to create a new column with a Null value using withColumn and a UDF?

Things I've tried (for testing purposes):

.withColumn("billable_item_sid", lit(getBiSidUDF(data_frame.eventkey)))
.withColumn("billable_item_sid", lit(None).castString())
Tried a when/otherwise condition for billable_item_sid for null checking

Answer 1

发现问题是在将 DataFrame 写入 json 时引起的。通过将 pyspark 升级到 3.1.1 解决了这个问题，它有一个名为ignoreNullFields=False

How to create a new column with a null value using Pyspark DataFrame?

Question

1 answers

solution1
0 2022-06-08 14:57:52

How to create a new column with a null value using Pyspark DataFrame?

Question

1 answers

solution1 0 2022-06-08 14:57:52

solution1
0 2022-06-08 14:57:52