Key error in dict when applying lambda function over Pyspark

Question

I have a dataframe df with some columns. I am trying to do something and I get a strange error instead of a result I am expecting.

My idea is to generate a numeric value for each distinct value of the dataframe column, and add the pair "real_value" : "numeric_value" to a dictionary.

The global dictionary where save the results is:

dict_res = {}

I have the next function that passing a value and the attribute name, gets the dictionary according to the atr from the global dictionary "dict_res" and if the value exists as a key into the dictionary, returns its numeric value, and if not, generates a new numeric value defined as float(len(dict_res[atr]) + 1) .

def indexMethod(value, atr):
    global dict_res
    res = float(len(dict_res[atr]) + 1)
    if value in dict_res[atr]:
        res = dict_res[atr][value]
    else:
        dict_res[atr][value] = res
    return res

The next code fragment iterates over the attributes I want to generate a numeric value from, and if a dictionary equivalent to the attribute is not created into the global dictionary "dict_res" it is created, and then applies with a lambda function the method specified above.

for column in columns_to_index:
    udf_func = UserDefinedFunction(lambda value: indexMethod(value, column), DoubleType())
    if(not column in dict_res):
        dict_res[column] = {}
    col2 = udf_func(df[column])
    df = df.withColumn('newCol', col2)
    ....

So what I expect is to generate the dictionary with the equivalences, as well as a new column with those same equivalences as in the dictionary.

After the process, I print the dict as follows:

print(dict_res)

And the result I get is the next one:

{'ATR1': {}, 'ATR2': {}, ...}

So the dictionaries are empty. But the most significant error is that when I try to show the dataframe 'df' I get the next error:

KeyError: 'ATR1'

How is that possible if I have a dictionary with that key?

Hope you can help me...

Answer 1

I don't think you can update an exterior python object (global or not) through a UDF that only operates actions on rows.

Another way to solve the problem is to use distinct() :

dict_res = dict()
for column in columns_to_index:
    dict_res[column] = df.select(column).distinct().toPandas().to_dict()

Key error in dict when applying lambda function over Pyspark

Question

1 answers

solution1
0 2017-09-25 18:34:08

Key error in dict when applying lambda function over Pyspark

Question

1 answers

solution1 0 2017-09-25 18:34:08

solution1
0 2017-09-25 18:34:08