简体   繁体   English

用户定义的 function 组合 CUDF dataframe 列

[英]User defined function to combine CUDF dataframe columns

As per the title, I am trying to combine the row values from different cudf.DataFrame columns.根据标题,我正在尝试组合来自不同cudf.DataFrame列的行值。 The following code works for a standard pandas.DataFrame :以下代码适用于标准pandas.DataFrame

import pandas as pd
data = {'a': [1], 'b': [2], 'c': [3], 'd': [4]}
df = pd.DataFrame.from_dict(data)

def f(row):
    return {'dictfromcolumns': [row['a'], row['b'], row['c'], row['d']]}

df['new'] = df.apply(f, axis=1)

The equivalent code with cudf, should look like:与 cudf 的等效代码应如下所示:

dfgpu = cudf.DataFrame(df)
dfgpu['new'] = dfgpu.apply(f, axis=1)

But this will throw the following ValueError exception:但这会抛出以下ValueError异常:

ValueError: user defined function compilation failed.

Is there an alternative way to accomplish the combination of cudf columns (in my case I need to create a dict and store it as the value in a new column)是否有另一种方法来完成 cudf 列的组合(在我的情况下,我需要创建一个 dict 并将其存储为新列中的值)

Thanks!谢谢!

pandas allows storing arbitrary data structures inside columns (such as a dictionary of lists, in your case). pandas 允许在列内存储任意数据结构(例如列表字典,在您的情况下)。 cuDF does not. cuDF 没有。 However, cuDF provides an explicit data type called struct , which is common in big data processing engines and may be want you want in this case.但是,cuDF 提供了一种称为struct的显式数据类型,这种数据类型在大数据处理引擎中很常见,在这种情况下您可能需要它。

Your UDF is failing because Numba.cuda doesn't understand the dictionary/list data structures.您的 UDF 失败是因为 Numba.cuda 不理解字典/列表数据结构。

The best way to do this is to first collect your data into a single column as a list (cuDF also provides an explicit list data type).最好的方法是首先将您的数据作为列表收集到单个列中(cuDF 还提供了显式list数据类型)。 You can do this by melting your data from wide to long (and adding a key column to keep track of the original rows) and then doing a groupby collect operation.您可以通过将数据从宽到长融合(并添加一个键列来跟踪原始行)然后执行 groupby collect操作来做到这一点。 Then, create the struct column.然后,创建结构列。

import pandas as pd
import cudf
import numpy as np

data = {'a': [1, 10], 'b': [2, 11], 'c': [3, 12], 'd': [4, 13]}
df = pd.DataFrame.from_dict(data)

gdf = cudf.from_pandas(df)
gdf["key"] = np.arange(len(gdf))

melted = gdf.melt(id_vars=["key"], value_name="struct_key_name") # wide to long format
gdf["new"] = melted.groupby("key").collect()[["struct_key_name"]].to_struct()
gdf
    a   b   c   d   key     new
0   1   2   3   4   0   {'struct_key_name': [1, 4, 2, 3]}
1   10  11  12  13  1   {'struct_key_name': [10, 13, 11, 12]}

Note that the struct column in cuDF is not the same as "a dictionary in a column".请注意,cuDF 中的 struct 列与“列中的字典”不同。 It's a much more efficient, explicit type meant for storing and manipulating columnar {key: value} data.它是一种更有效、更明确的类型,用于存储和操作列 {key: value} 数据。 cuDF provides a "struct accessor" to manipulate structs, which you can access at df[col].struct.XXX . cuDF 提供了一个“结构访问器”来操作结构,您可以在df[col].struct.XXX访问它。 It currently supports selecting individual fields (keys) and the explode operation.它目前支持选择单个字段(键)和展开操作。 You can also carry structs around in other operations (including I/O).您还可以在其他操作(包括 I/O)中携带结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM