简体   繁体   English

如何在 Spark 结构化流中使用 UDF(用户定义函数)?

[英]How to use UDF(user defined function) on spark structured streaming?

I have made a little search.我做了一点搜索。 this answer tells me that I can use UDF on GroupedData, it works and I can handle those rows and columns in GroupData with my own function. 这个答案告诉我,我可以在 GroupedData 上使用 UDF,它可以工作,我可以用我自己的 function 处理 GroupData 中的那些行和列。

According to official tutorial .根据官方教程 They use groupBy() and window() operations to express windowed aggregations like below.他们使用 groupBy() 和 window() 操作来表达窗口聚合,如下所示。

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
    window(words.timestamp, "10 minutes", "5 minutes"),
    words.word
).count()

My questions is that whether there is a way to use UDF on words.groupBy( window(words.timestamp, "10 minutes", "5 minutes") . May be code like below? I have tried but it not work.我的问题是,是否有办法在words.groupBy(window(words.timestamp, "10 minutes", "5 minutes")上使用 UDF。可能是下面的代码?我试过但它不起作用。

schema = StructType(
    [StructField("key", StringType()), StructField("avg_min", DoubleType())]
)

@panda_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
    #whatever user-defined code 

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
windowedCounts = words.groupBy(
    window(words.timestamp, "10 minutes", "5 minutes"),
    words.word
).apply(g)

In Spark 3 you can use the applyInPandas instead, without explicit @pandas_udf (see documentation ):在 Spark 3 中,您可以改用applyInPandas ,而无需显式@pandas_udf (请参阅文档):

def g(df):
    #whatever user-defined code 

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
windowedCounts = words.groupBy(
    window(words.timestamp, "10 minutes", "5 minutes"),
    words.word
).applyInPandas(g, schema=schema)

In this case you'll get Pandas DataFrame and return back the Pandas DataFrame.在这种情况下,您将得到 Pandas DataFrame 并返回 Pandas ZBA834BA059A9A379E488C112175EB8。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM