简体   繁体   中英

PySpark Dataframe: Column based on existence and Value of another column

What I am trying to do is set a Value "EXIST" based on a.isnotnull in potentially nonexisting column.

What I mean is: I have a dataframe A like

A     B     C
-------------------
1     a     "Test"  
2     b     null
3     c     "Test2"

Where C isnt necessarily defined. I want to define another Dataframe BB:

D     E       F
---------------
1     a      'J'
2     b      'N'
3     c      'J'

Where is the Column BF is either 'N' everywhere in case that A.C is not defined, or 'N' if A.Cs value is null and 'J' if the value is not null.

How would you proceed at this point?

I thought of using when statement

DF.withColumn('F'. when(A.C.isNotNull(), 'J').otherwise('N')) but how would you check for the existence of the Column in the same statement?

First you check if the column exists. If not, you create it.

from pyspark.sql import functions as F

if "c" not in df.columns:
    df = df.withColumn("c", F.lit(None))

then you create the column F:

df.withColumn('F'. F.when(F.col("C").isNotNull(), 'J').otherwise('N'))

You can check the column's present using 'c' in data_sdf.columns . Here's an example using it.

Let's say the input dataframe has 3 columns - ['a', 'b', 'c']

data_sdf. \
    withColumn('d', 
               func.when(func.col('c').isNull() if 'c' in data_sdf.columns else func.lit(True), func.lit('N')).
               otherwise(func.lit('J'))
               ). \
    show()

# +---+---+----+---+
# |  a|  b|   c|  d|
# +---+---+----+---+
# |  1|  2|   3|  J|
# |  1|  2|null|  N|
# +---+---+----+---+

Now, let's say there are only 2 columns - ['a', 'b']

# +---+---+---+
# |  a|  b|  d|
# +---+---+---+
# |  1|  2|  N|
# |  1|  2|  N|
# +---+---+---+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM