How do I compile and bring in multiple outputs from the same worker?

Question

I'm developing a kubeflow pipeline that takes in a data set, splits that dataset into two different datasets based on a filter inside the code, and outputs both datasets. That function looks like the following:

def merge_promo_sales(input_data: Input[Dataset],
                  output_data_hd: OutputPath("Dataset"),
                  output_data_shop: OutputPath("Dataset")):


import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 500)
import numpy as np
from google.cloud import bigquery
from utils import google_bucket

client = bigquery.Client("gcp-sc-demand-plan-analytics")
print("Client creating using default project: {}".format(client.project), "Pulling Data")

query = """
SELECT * FROM `gcp-sc-demand-plan-analytics.Modeling_Input.monthly_delivery_type_sales` a

Left Join `gcp-sc-demand-plan-analytics.Modeling_Input.monthly_promotion` b
on a.ship_base7 = b.item_no
and a.oper_cntry_id = b.corp_cd
and a.dmand_mo_yr = b.dates
"""

query_job = client.query(
        query,
        # Location must match that of the dataset(s) referenced in the query.
        location="US",
        )  # API request - starts the query
df = query_job.to_dataframe()
df.drop(['corp_cd', 'item_no', 'dates'], axis = 1, inplace=True)
df.loc[:, 'promo_objective_increase_margin':] = df.loc[:, 'promo_objective_increase_margin':].fillna(0)
items = df_['ship_base7'].unique()
df = df[df['ship_base7'].isin(items)]
df_hd = df[df['location_type'] == 'home_delivery']
df_shop = df[df['location_type'] != 'home_delivery']

df_hd.to_pickle(output_data_hd) df_shop.to_pickle(output_data_shop)

That part works fine. When I try to feed those two data sets into the next function with the compiler, I hit errors.

I tried the following:

@kfp.v2.dsl.pipeline(name=PIPELINE_NAME)
def my_pipeline():
    merge_promo_sales_nl = merge_promo_sales(input_data = new_launch.output)
    rule_3_hd = rule_3(input_data = merge_promo_sales_nl.output_data_hd)
    rule_3_shop = rule_3(input_data = merge_promo_sales_nl.output_data_shop)`

The error I get is the following: AttributeError: 'ContainerOp' object has no attribute 'output_data_hd'

output_data_hd is the parameter I put that dataset out to but apparently it's not the name of parameter kubeflow is looking for.

Answer 1

I just figured this out.

When you run multiple outputs, you use the following in the compile section:

rule_3_hd = rule_3(input_data = merge_promo_sales_nl.outputs['output_data_hd'])
rule_3_shop = rule_3(input_data = merge_promo_sales_nl.outputs['output_data_shop'])

How do I compile and bring in multiple outputs from the same worker?

Question

1 answers

solution1
1 2022-12-08 20:24:40

How do I compile and bring in multiple outputs from the same worker?

Question

1 answers

solution1 1 2022-12-08 20:24:40

solution1
1 2022-12-08 20:24:40