简体   繁体   中英

Creating multiple columns in spark Dataframe dynamically

I have dictionary with information like,

dict_segs = {'key1' : {'a' : {'col1' : 'value1', 'col2' : 'value2', 'col3': 'value3'}, 
                'b' : {'col2' : 'value2', 'col3' : 'value3'}, 
                'c' : {'col1' : 'value1'}},
        'key2' : {'d' : {'col3' : 'value3', 'col2' : 'value2'},
                'f' : {'col1' : 'value1', 'col4' : 'value4'}}}

TO DO :

keys are basically 'segments', for which the underlying dictionaries ie a, b, c for key1 are 'subsegments'. For every subsegment the filter condition is available in underlying dictionaries for subsegments ie a, b, c, d, f. Also, the filter condition for subsegments dictionary keys are also the column names of pyspark dataframe.

I want to create subsegment columns in pyspark dataframe at one go for each segment, and values for each subsegment column when meets the filter condition will be 1, else 0, something like,

for item in dict_segs:
    pyspark_dataframe.withColumn(*dict_segs[item].keys(), when(meeting filter criteria with respect to each key), 1).otherwise(0))

On doing research i was able to find something similar in scala, but the column filtering condition there is static, but for above logic ie dynamic. Please see below scala logic,

Spark/Scala repeated calls to withColumn() using the same function on multiple columns

Need support to derive above logic for each segment as per pseudo code above.

Thanks.

You are looking for a select statement:

Let's create a sample dataframe:

df = spark.createDataFrame(
    sc.parallelize([["value" + str(i) for i in range(1, 5)], ["value" + str(i) for i in range(5, 9)]]), 
    ["col" + str(i) for i in range(1, 5)]
)

+------+------+------+------+
|  col1|  col2|  col3|  col4|
+------+------+------+------+
|value1|value2|value3|value4|
|value5|value6|value7|value8|
+------+------+------+------+

Now for all the keys in the dictionary, for all the subkeys in dict_seg[key] and for all the columns in dict_seg[key][subkey] :

import pyspark.sql.functions as psf
df.select(
    ["*"] +
    [
        eval('&'.join([
            '(df["' + c + '"] == "' + dict_segs[k][sk][c] + '")' for c in dict_segs[k][sk].keys()
        ])).cast("int").alias(sk) 
        for k in dict_segs.keys() for sk in dict_segs[k].keys()
    ]
).show()

+------+------+------+------+---+---+---+---+---+
|  col1|  col2|  col3|  col4|  a|  b|  c|  d|  f|
+------+------+------+------+---+---+---+---+---+
|value1|value2|value3|value4|  1|  1|  1|  1|  1|
|value5|value6|value7|value8|  0|  0|  0|  0|  0|
+------+------+------+------+---+---+---+---+---+
  • "*" allows you to keep all the previously existing columns, it can be replaced by df.columns .
  • alias(sk) allows you to give name sk to the new column
  • cast("int") to change type boolean into type int

I don't really understand why you have a depth 3 dictionary though, it seems that key1, key2 aren't really useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM