简体   繁体   中英

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?

I tried looping over the datasets like:

li = ['dataset1_path', 'dataset2_path']

union_df = None
for p in li:
  @transforms_df(
    my_input = Input(p), 
    Output(p+"_output")
  )
  def my_compute_function(my_input):
    return my_input

  if union_df is None:
    union_df = my_compute_function
  else:
    union_df = union_df.union(my_compute_function)

But, this doesn't generate the unioned output.

This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.

There's some bonus logging going on here as well.

Hope this helps

from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging


def transform_generator():
    transforms = []
    transf_dict = {## enter your dynamic mappings here ##}

    for value in transf_dict:
        @transform(
            out=Output(' path to your output here '.format(val=value)),
            inpt=Input(" path to input here ".format(val=value)),
        )
        def update_set(ctx, inpt, out):
            spark = ctx.spark_session
            sc = spark.sparkContext

            filesystem = list(inpt.filesystem().ls())
            file_dates = []
            for files in filesystem:
                with inpt.filesystem().open(files.path) as fi:
                    data = json.load(fi)
                file_dates.append(data)

            logging.info('info logs:')
            logging.info(file_dates)
            json_object = json.dumps(file_dates)
            df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
            df_2 = df_2.withColumn('upload_date', F.current_date())

            df_2.drop_duplicates()
            out.write_dataframe(df_2)
        transforms.append(update_logs)
    return transforms


TRANSFORMS = transform_generator()

So this question breaks down in two questions.

How to handle transforms with programatic input paths

To handle transforms with programatic inputs, it is important to remember two things:

1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.

2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.

With these two premises, like in your example or @jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.

dataset_paths = ['dataset1_path', 'dataset2_path']

for path in dataset_paths:
  @transforms_df(
    my_input = Input(path), 
    Output(f"{path}_output")
  )
  def my_compute_function(my_input):
    return my_input

However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:

dataset_paths = ['dataset1_path', 'dataset2_path']

all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
@transforms_df(*all_args)
def my_compute_function(*args):
    input_dfs = []
    for arg in args:
       # there are other arguments like ctx in the args list, so we need  to check for type. You can also use kwargs for more determinism.
       if isinstance(arg, pyspark.sql.DataFrame):
            input_dfs.append(arg)
    
    # now that you have your dfs in a list you can union them
    # Note I didn't test this code, but it should be something like this
    ...

How to union datasets with different schemas.

For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004

from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row

def customUnion(df1, df2):
    cols1 = df1.columns
    cols2 = df2.columns
    total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
    def expr(mycols, allcols):
        def processCols(colname):
            if colname in mycols:
                return colname
            else:
                return lit(None).alias(colname)
        cols = map(processCols, allcols)
        return list(cols)
    appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
    return appended

Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:

from transforms.api import transform_df, Input, Output
from functools import reduce


datasets = [
    'dataset1',
    'dataset2',
    'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
    **{'output': Output('output/folder/path/unioned_dataset')},
    **inputs
}


@transform_df(**kwargs)
def my_compute_function(**inputs):
    unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
    return unioned_df

Regarding unions of different schemas , since Spark 3.1 one can use this :

df1.unionByName(df2, allowMissingColumns=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM