简体   繁体   中英

How to create a new column with a null value using Pyspark DataFrame?

I'm having issues with using pyspark dataframes. I have a column called eventkey which is a concatenation of the following elements: account_type , counter_type and billable_item_sid . I have a function called apply_event_key_transform in which I want to break up the concatenated eventkey and create new columns for each of the elements.

def apply_event_key_transform(data_frame: DataFrame):

    output_df = data_frame.withColumn("account_type", getAccountTypeUDF(data_frame.eventkey)) \
        .withColumn("counter_type", getCounterTypeUDF(data_frame.eventkey)) \
        .withColumn("billable_item_sid", getBiSidUDF(data_frame.eventkey))
    output_df.drop("eventkey")
    return output_df

I've created UDF functions to retrieve the account_type , counter_type and billable_item_sid from a given eventkey value. I have a class called EventKey that takes the full eventkey string as a constructor param, and creates an object with data members to access the account_type , counter_type and billable_item_sid .

getAccountTypeUDF = udf(lambda x: get_account_type(x))
getCounterTypeUDF = udf(lambda x: get_counter_type(x))
getBiSidUDF = udf(lambda x: get_billable_item_sid(x))


def get_account_type(event_key: str):
    event_key_obj = EventKey(event_key)
    return event_key_obj.account_type.name


def get_counter_type(event_key: str):
    event_key_obj = EventKey(event_key)
    return event_key_obj.counter_type


def get_billable_item_sid(event_key: str):
    event_key_obj = EventKey(event_key)
    return event_key_obj.billable_item_sid

The issue that I'm running into is that a billable_item_sid can be null, but when I attempt to call withColumn with a None, the entire frame drops the column when I attempt to aggregate the data later. Is there a way to create a new column with a Null value using withColumn and a UDF?

Things I've tried (for testing purposes):

  1. .withColumn("billable_item_sid", lit(getBiSidUDF(data_frame.eventkey)))
  2. .withColumn("billable_item_sid", lit(None).castString())
  3. Tried a when/otherwise condition for billable_item_sid for null checking

发现问题是在将 DataFrame 写入 json 时引起的。通过将 pyspark 升级到 3.1.1 解决了这个问题,它有一个名为ignoreNullFields=False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM