简体   繁体   English

如何在 Palantir Foundry 中合并多个动态输入?

[英]How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically.我想在 Palantir Foundry 中合并多个数据集,数据集的名称是动态的,因此我无法在transform_df()中静态地给出数据集名称。 Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?有没有一种方法可以动态地将多个输入输入到transform_df并合并所有这些数据帧?

I tried looping over the datasets like:我尝试遍历数据集,例如:

li = ['dataset1_path', 'dataset2_path']

union_df = None
for p in li:
  @transforms_df(
    my_input = Input(p), 
    Output(p+"_output")
  )
  def my_compute_function(my_input):
    return my_input

  if union_df is None:
    union_df = my_compute_function
  else:
    union_df = union_df.union(my_compute_function)

But, this doesn't generate the unioned output.但是,这不会生成联合 output。

This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different.这应该能够对您进行一些更改,这是一个带有 json 文件的动态数据集示例,您的情况可能只会略有不同。 Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify.这是一种通用方法,您可以执行动态 json 输入数据集,该输入数据集应该适用于您可以指定的任何类型的动态输入文件类型或铸造数据集的内部。 This generic example is working on a set of json files uploaded to a dataset node in the platform.这个通用示例正在处理上传到平台中数据集节点的一组 json 文件。 This should be fully dynamic.这应该是完全动态的。 Doing a union after this should be a simple matter.在这之后做一个工会应该是一件简单的事情。

There's some bonus logging going on here as well.这里也有一些奖金记录。

Hope this helps希望这可以帮助

from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging


def transform_generator():
    transforms = []
    transf_dict = {## enter your dynamic mappings here ##}

    for value in transf_dict:
        @transform(
            out=Output(' path to your output here '.format(val=value)),
            inpt=Input(" path to input here ".format(val=value)),
        )
        def update_set(ctx, inpt, out):
            spark = ctx.spark_session
            sc = spark.sparkContext

            filesystem = list(inpt.filesystem().ls())
            file_dates = []
            for files in filesystem:
                with inpt.filesystem().open(files.path) as fi:
                    data = json.load(fi)
                file_dates.append(data)

            logging.info('info logs:')
            logging.info(file_dates)
            json_object = json.dumps(file_dates)
            df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
            df_2 = df_2.withColumn('upload_date', F.current_date())

            df_2.drop_duplicates()
            out.write_dataframe(df_2)
        transforms.append(update_logs)
    return transforms


TRANSFORMS = transform_generator()

So this question breaks down in two questions.所以这个问题分为两个问题。

How to handle transforms with programatic input paths如何使用编程输入路径处理转换

To handle transforms with programatic inputs, it is important to remember two things:要处理带有程序输入的转换,重要的是要记住两件事:

1st - Transforms will determine your inputs and outputs at CI time. 1st - 转换将在 CI 时确定您的输入和输出。 Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.这意味着您可以拥有生成转换的 python 代码,但您无法从数据集中读取路径,它们需要硬编码到生成转换的 python 代码中。

2nd - Your transforms will be created once, during the CI execution. 2nd - 您的转换将在 CI 执行期间创建一次。 Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.这意味着无论何时构建数据集,您都无法使用增量或特殊逻辑来生成不同的路径。

With these two premises, like in your example or @jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.有了这两个前提,就像在您的示例或@jeremy-david-gamet 的(ty 表示回复,给了您+1)中,您可以使用 python 代码在 CI 时生成您的路径。

dataset_paths = ['dataset1_path', 'dataset2_path']

for path in dataset_paths:
  @transforms_df(
    my_input = Input(path), 
    Output(f"{path}_output")
  )
  def my_compute_function(my_input):
    return my_input

However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:但是要将它们联合起来,您需要第二次转换来执行联合,您需要传递多个输入,因此您可以使用*args**kwargs

dataset_paths = ['dataset1_path', 'dataset2_path']

all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
@transforms_df(*all_args)
def my_compute_function(*args):
    input_dfs = []
    for arg in args:
       # there are other arguments like ctx in the args list, so we need  to check for type. You can also use kwargs for more determinism.
       if isinstance(arg, pyspark.sql.DataFrame):
            input_dfs.append(arg)
    
    # now that you have your dfs in a list you can union them
    # Note I didn't test this code, but it should be something like this
    ...

How to union datasets with different schemas.如何合并具有不同模式的数据集。

For this part there are plenty of Q&A out there on how to union different dataframes in spark.对于这一部分,有很多关于如何在 spark 中合并不同数据帧的问答。 Here is a short code example copied from https://stackoverflow.com/a/55461824/26004这是从https://stackoverflow.com/a/55461824/26004复制的简短代码示例

from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row

def customUnion(df1, df2):
    cols1 = df1.columns
    cols2 = df2.columns
    total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
    def expr(mycols, allcols):
        def processCols(colname):
            if colname in mycols:
                return colname
            else:
                return lit(None).alias(colname)
        cols = map(processCols, allcols)
        return list(cols)
    appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
    return appended

Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs.由于输入和输出是在 CI 时确定的,因此我们无法形成真正的动态输入。 We will have to somehow point to specific datasets in the code.我们将不得不以某种方式指向代码中的特定数据集。 Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:假设数据集的路径共享相同的根,以下似乎需要最少的维护:

from transforms.api import transform_df, Input, Output
from functools import reduce


datasets = [
    'dataset1',
    'dataset2',
    'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
    **{'output': Output('output/folder/path/unioned_dataset')},
    **inputs
}


@transform_df(**kwargs)
def my_compute_function(**inputs):
    unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
    return unioned_df

Regarding unions of different schemas , since Spark 3.1 one can use this :关于不同模式的联合因为Spark 3.1可以使用这个

df1.unionByName(df2, allowMissingColumns=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM