简体   繁体   English

Snowflake:是否可以将少数列(DataFrame)传递到 Snowpark UDTF(Python 语言)?

[英]Snowflake: Is it possible to pass few Columns (DataFrame) into Snowpark UDTF ( Python language)?

I wrote UDTF on Snowpark/Python which receives one Column as argument, all works fine.我在 Snowpark/Python 上编写了 UDTF,它接收一个 Column 作为参数,一切正常。 Is it possible (no documentation regarding this feature) to pass few columns (ie DataFrame) into UDTF?是否可以(没有关于此功能的文档)将少数列(即 DataFrame)传递到 UDTF?

My code below dosn't work, exception is "TypeError: 'TABLE FUNCTION' expected Column or str, got: <class 'snowflake.snowpark.dataframe.DataFrame'>"我下面的代码不起作用,异常是“TypeError: 'TABLE FUNCTION' expected Column or str, got: <class 'snowflake.snowpark.dataframe.DataFrame'>”

Can anybody suggest how to do this (except concatenating few columns into one and pass one column into UDTF)?任何人都可以建议如何执行此操作(除了将几列连接成一列并将一列传递给 UDTF)?

import uuid
@udtf(output_schema=["c1","c2","x"], 
      input_types =[StringType(), StringType(), IntegerType()],
      name="udft_two_col_test", 
      replace=True, 
      session=ses)
class udft_two_col_test:
    def process(self, c1:str, c2:str, n: int) -> Iterable[Tuple[str, str, str]]: 
        for i in range(n):
            yield (c1, c2, f'{n}-{c1}-{c2}')

            
df = ses.create_dataframe([str(uuid.uuid4()).split('-') for i in range(1,10,1)], schema=['c1','c2','c3','c4','c5'])
df.sort('c1','c2').show()

------------------------------------------------
|"C1"      |"C2"  |"C3"  |"C4"  |"C5"          |
------------------------------------------------
|125a9845  |f7e2  |48dd  |b51c  |42ba82531fe7  |
|136da5dc  |62cb  |47c0  |98f9  |4182421e6d2b  |
|300380e2  |b365  |4d6a  |8d6b  |1092e4c24ec8  |
|3d9d9882  |0fb2  |4209  |bf11  |4341b0336946  |
|43c4147d  |1603  |4548  |ad8e  |4df50cddd682  |
|9e1024ca  |61d5  |404d  |88f8  |79393083eb30  |
|bf25e899  |5697  |4c36  |8533  |e3009c68ce9b  |
|d6dd677f  |035b  |49e7  |9236  |316741579f3c  |
|f4b83587  |26e1  |48cf  |8563  |0586ccb6602e  |
------------------------------------------------

df.join_table_function("udft_two_col_test", df["c1","c2"], lit(3)).sort('c1','c2').show(100)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
---> 17 df.join_table_function("udft_two_col_test", df["c1","c2"], lit(3)).sort('c1','c2').show(100)
...
TypeError: 'TABLE FUNCTION' expected Column or str, got: <class 'snowflake.snowpark.dataframe.DataFrame'>

Try passing the columns one by one:尝试一一传递列:

df.join_table_function(udft_two_col_test_dec("c1", "c2", lit(3))).show()
# or
df.join_table_function(udft_two_col_test_dec.name, "c1", "c2", lit(3)).show()

In the documentation of join_table_function you see an example like this:join_table_function的文档中,您会看到如下示例:

df.join_table_function(split_to_table(df["addresses"], lit(" "))).show()

where df["addresses"] is a single column of the dataframe, and lit(" ") is another column.其中df["addresses"]是 dataframe 的单列,而lit(" ")是另一列。

Cheers!干杯!

It is possible with UDTFs (User Defined Table Functions) which comes with v0.7.0可以使用 v0.7.0 附带的 UDTF(用户定义的表函数)

Here is an example:这是一个例子:

from collections import Counter
from typing import Iterable, Tuple
from snowflake.snowpark.functions import lit
class MyWordCount:
    def __init__(self):
        self._total_per_partition = 0

    def process(self, s1: str) -> Iterable[Tuple[str, int]]:
        words = s1.split()
        self._total_per_partition = len(words)
        counter = Counter(words)
        yield from counter.items()

    def end_partition(self):
        yield ("partition_total", self._total_per_partition)

udtf_name = "word_count_udtf"
word_count_udtf = session.udtf.register(
    MyWordCount, ["word", "count"], name=udtf_name, is_permanent=False, replace=True)


# Call it by its name
df1 = session.table_function(udtf_name, lit("w1 w2 w2 w3 w3 w3"))
df1.show()
-----------------------------
|"WORD"           |"COUNT"  |
-----------------------------
|w1               |1        |
|w2               |2        |
|w3               |3        |
|partition_total  |6        |
-----------------------------

# Call it by the returned callable instance
df2 = session.table_function(word_count_udtf(lit("w1 w2 w2 w3 w3 w3")))
df2.show()
-----------------------------
|"WORD"           |"COUNT"  |
-----------------------------
|w1               |1        |
|w2               |2        |
|w3               |3        |
|partition_total  |6        |
-----------------------------

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以在雪花雪园的 python UDF 中编写 sql 查询? - is it possible to write a sql query in a python UDF in snowflake snowpark? 是否有 Snowflake Snowpark python 相当于“time_slice” - Is there a Snowflake Snowpark python equivalent for 'time_slice' 将库 snowflake-snowpark-python 添加到 Azure Function - Adding Library snowflake-snowpark-python to Azure Function 使用 Snowpark python 将雪花数据卸载到 S3。 如何提供存储集成选项 - Use Snowpark python to unload snowflake data to S3. How to provide storage integration option 在 Snowpark 中复制 GETBIT() Python - Replicating GETBIT() in Snowpark Python 计算DataFrame的几列中的值计数(Pandas Python) - Calc value count in few columns of DataFrame (Pandas Python) Snowflake 和 Snowpark 中的客户端库有什么区别? - What is the difference between client library in snowflake and Snowpark? 仅将 csv 的几列导入为 python pandas Z6A8064B5DF479450500553C47DZ5? - Importing only a few columns of a csv as a python pandas dataframe? 如何将数据帧的两列传递给python中的函数? - How to pass two columns of a dataframe to a function in python? Python:如何将 Dataframe 列作为参数传递给 function? - Python: How to pass Dataframe Columns as parameters to a function?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM