I have created a dataframe with the below schema, I'm trying to extract the first 10 values in "contents.monid" of each row for which I created an UDF 'udfTop'.
>>> df.printSchema()
|-- userid: long (nullable = true)
|-- contents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- monid: struct (nullable = true)
| | | |-- mon: string (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- count: integer (nullable = true)
>>> def take(n,data):
... if data is null:
... return null
... else:
... return data.take(n)
>>> udfTop = spark.udf.register("top_n", take)
But when I apply the udfTop on "contents" column's "monid" which is of struct type it gives me the error TypeError: 'NoneType' object is not callable although I've taken care of null values in UDF definition, also there are actually no null values in that column.
>>> new_df = df.withColumn("mon_ids", udfTop(10, "contents.monid"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not callable
I was able to follow similar approach and got no errors in Spark-shell using Scala, but I want this to work using PySpark.
For a row in df with the 'contents' column value as:
[[Art,1111],100],[[Art,1112],110],[[Art,1113],120],[[Art,1114],130].....(100 such values)
After applying UDF, that row should give the value of 'mon_ids' column in new_df as:
[Art,1111],[Art,1112],[Art,1113],[Art,1114]....(10 values)
Issue was observed to be with my spark.udf.register syntax, modifying it to the below syntax and also changing take.data(n) to data[:n] has resolved the issue:
udfTop=udf(take,ArrayType(IntegerType()))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.