简体   繁体   中英

Row transpose with value from a second column in pySpark

I have a pySpark dataframe with 4 columns (C1, C2, C3 and C4). In third column (C3) I have categorical values such as V1, V2, V3 and in fourth column (C4) I have its corresponding numeric values. I am trying to add additional columns V1, V2 and V3 where value of these new columns shall come from corresponding rows of 4th column (C4)

I am able to transpose row to columns through UDF and DF.withColumn but unable to bring the values

def valTocat(C3):
if C3 == 'xyz':
    return 1
else:
    return 0

but the following is not working

def valTocat((C3, C4)):
if C3 == 'xyz':
    return C4
else:
    return 0

Somehow I am unable to post the tabular format of the data but I think it is easy to visualize.

Some suggestion will be really appreciated

You can try pivot() your DataFrame :

from pyspark.sql.functions import expr

df.groupBy("c1","c2") \
 .pivot("c3") \
 .agg(expr("coalesce(first(c4))")).show()

You need the function coalesce to substitute the missing values with a null .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM