I have a dataframe df
with some columns. I am trying to do something and I get a strange error instead of a result I am expecting.
My idea is to generate a numeric value for each distinct value of the dataframe column, and add the pair "real_value" : "numeric_value"
to a dictionary.
The global dictionary where save the results is:
dict_res = {}
I have the next function that passing a value and the attribute name, gets the dictionary according to the atr from the global dictionary "dict_res" and if the value exists as a key into the dictionary, returns its numeric value, and if not, generates a new numeric value defined as float(len(dict_res[atr]) + 1)
.
def indexMethod(value, atr):
global dict_res
res = float(len(dict_res[atr]) + 1)
if value in dict_res[atr]:
res = dict_res[atr][value]
else:
dict_res[atr][value] = res
return res
The next code fragment iterates over the attributes I want to generate a numeric value from, and if a dictionary equivalent to the attribute is not created into the global dictionary "dict_res" it is created, and then applies with a lambda function the method specified above.
for column in columns_to_index:
udf_func = UserDefinedFunction(lambda value: indexMethod(value, column), DoubleType())
if(not column in dict_res):
dict_res[column] = {}
col2 = udf_func(df[column])
df = df.withColumn('newCol', col2)
....
So what I expect is to generate the dictionary with the equivalences, as well as a new column with those same equivalences as in the dictionary.
After the process, I print the dict as follows:
print(dict_res)
And the result I get is the next one:
{'ATR1': {}, 'ATR2': {}, ...}
So the dictionaries are empty. But the most significant error is that when I try to show the dataframe 'df' I get the next error:
KeyError: 'ATR1'
How is that possible if I have a dictionary with that key?
Hope you can help me...
I don't think you can update an exterior python object (global or not) through a UDF
that only operates actions on rows.
Another way to solve the problem is to use distinct()
:
dict_res = dict()
for column in columns_to_index:
dict_res[column] = df.select(column).distinct().toPandas().to_dict()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.