I have dictionary with information like,
dict_segs = {'key1' : {'a' : {'col1' : 'value1', 'col2' : 'value2', 'col3': 'value3'},
'b' : {'col2' : 'value2', 'col3' : 'value3'},
'c' : {'col1' : 'value1'}},
'key2' : {'d' : {'col3' : 'value3', 'col2' : 'value2'},
'f' : {'col1' : 'value1', 'col4' : 'value4'}}}
TO DO :
keys are basically 'segments', for which the underlying dictionaries ie a, b, c for key1 are 'subsegments'. For every subsegment the filter condition is available in underlying dictionaries for subsegments ie a, b, c, d, f. Also, the filter condition for subsegments dictionary keys are also the column names of pyspark dataframe.
I want to create subsegment columns in pyspark dataframe at one go for each segment, and values for each subsegment column when meets the filter condition will be 1, else 0, something like,
for item in dict_segs:
pyspark_dataframe.withColumn(*dict_segs[item].keys(), when(meeting filter criteria with respect to each key), 1).otherwise(0))
On doing research i was able to find something similar in scala, but the column filtering condition there is static, but for above logic ie dynamic. Please see below scala logic,
Spark/Scala repeated calls to withColumn() using the same function on multiple columns
Need support to derive above logic for each segment as per pseudo code above.
Thanks.
You are looking for a select
statement:
Let's create a sample dataframe:
df = spark.createDataFrame(
sc.parallelize([["value" + str(i) for i in range(1, 5)], ["value" + str(i) for i in range(5, 9)]]),
["col" + str(i) for i in range(1, 5)]
)
+------+------+------+------+
| col1| col2| col3| col4|
+------+------+------+------+
|value1|value2|value3|value4|
|value5|value6|value7|value8|
+------+------+------+------+
Now for all the keys
in the dictionary, for all the subkeys
in dict_seg[key]
and for all the columns
in dict_seg[key][subkey]
:
import pyspark.sql.functions as psf
df.select(
["*"] +
[
eval('&'.join([
'(df["' + c + '"] == "' + dict_segs[k][sk][c] + '")' for c in dict_segs[k][sk].keys()
])).cast("int").alias(sk)
for k in dict_segs.keys() for sk in dict_segs[k].keys()
]
).show()
+------+------+------+------+---+---+---+---+---+
| col1| col2| col3| col4| a| b| c| d| f|
+------+------+------+------+---+---+---+---+---+
|value1|value2|value3|value4| 1| 1| 1| 1| 1|
|value5|value6|value7|value8| 0| 0| 0| 0| 0|
+------+------+------+------+---+---+---+---+---+
"*"
allows you to keep all the previously existing columns, it can be replaced by df.columns
. alias(sk)
allows you to give name sk
to the new column cast("int")
to change type boolean into type int I don't really understand why you have a depth 3 dictionary though, it seems that key1, key2
aren't really useful.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.