简体   繁体   中英

concatenate list of columns (variable) into one new column dataframe pyspark

I am using pyspark and I have a dataframe df_001 which contain N columns 'rec' and 'id' and 'NAME'.

IF I want to add a new column 'unq_id' that will concatenate 'rec' and 'id' for example. When I do that it works perfectly:

df_f_final = df_001.withColumn('unq_id', sf.concat(sf.col('rec'), sf.lit('||'), sf.col('id'))) .

but I need to make the list of column to concatenate dynamique (list for example): How can I do that? for example create list: LL = ['rec', 'id', 'NAME'] or LL = ['rec', 'NAME'] and use that to generate the dataframe df_f_final and concatenate the columns that are in the list LL

It is easy i think but it s driving me crazy

Thank you for your help

check this out and let me know if it helps.

    #InputDF
    # +------+------+
    # |rec_id|  name|
    # +------+------+
    # |    a1| ricky|
    # |    b1|sachin|
    # +------+------+

    LL = ['rec_id', 'name']


    df1 = df.withColumn("unq_id_value", F.concat( *[F.concat(F.col(col),F.lit("||")) for col in LL]))

    df2 = df1.withColumn("unq_id_value",F.expr("substring(unq_id_value, 1, length(unq_id_value)-2)"))

    df2.show()

    # +------+------+------------+
    # |rec_id|  name|unq_id_value|
    # +------+------+------------+
    # |    a1| ricky|   a1||ricky|
    # |    b1|sachin|  b1||sachin|
    # +------+------+------------+

Thank you Loka for your answer finally i found a solution, it s similar to yours. I did that and it s working

cols = ['col1', lit('||'), 'col2', lit('||'), 'col3']
unq_id = sf.udf(lambda cols: "".join([x for x in cols]), StringType())
df.withColumn('unqid', unq_id(sf.array(cols))).show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM