简体   繁体   English

Spark:在数据集上应用地图功能 <T> 在java中

[英]Spark : Applying map function on Dataset<T> in java

I have below code which works fine with foreach function on Dataset. 我有下面的代码,可以很好地与数据集上的foreach函数配合使用。 finalJoined is a DataFrame . finalJoined是一个DataFrame

    KieServices ks = KieServices.Factory.get();
    KieContainer kContainer = ks.getKieClasspathContainer();
    ClassTag<KieBase> classTagTest =  scala.reflect.ClassTag$.MODULE$.apply(KieBase.class);
    Broadcast<KieBase> broadcastRules = context.broadcast(kContainer.getKieBase("rules"), classTagTest);


    Encoder<RuleParams> encoder = Encoders.bean(RuleParams.class);
        Dataset<RuleParams> ds = new Dataset<RuleParams>(sparkSession, finalJoined.logicalPlan(), encoder);
        System.out.println("Printing ruleParams DS");
        ds.show();
        ds.foreach(ruleParam -> droolprocess(broadcastRules.value(), ruleParam));

Here foreach method returns void . 这里的foreach方法返回void

I need Dataset<RuleParams> as return value . 我需要Dataset<RuleParams>作为返回值。 below is my droolprocess method which calls rule engine and updates RuleParams objects. 下面是我的droolprocess方法,它调用规则引擎并更新RuleParams对象。

public static void droolprocess(KieBase base, RuleParams ruleParams) {
        StatelessKieSession session = base.newStatelessKieSession();
session.execute(CommandFactory.newInsert(ruleParams));
        System.out.println("After firing  rules");
        System.out.println(ruleParams.getPriceItemParam1());
        System.out.println(ruleParams.getCisDivision());
         }

I have seen some questions on stackoverflow and elsewhere but I am not sure how to write map function instead of foreach to return Dataset<RuleParams> 我已经在stackoverflow和其他地方看到了一些问题,但是我不确定如何编写map函数而不是foreach返回Dataset<RuleParams>

Can anyone help here? 有人可以帮忙吗?

You can use like below: 您可以如下使用:

 Dataset<RuleParams> ds = new Dataset<RuleParams>(sparkSession, finalJoined.logicalPlan(), encoder);
    StructType schema = ds.schema();
    ds = ds.map(ruleParams -> {

RuleParams theRuleParams= ruleParams;

    ...//your processing
    return theRuleParams;
    }, RowEncoder.apply(schema));

Once mapping is done you need to return row by creating each of the row if you adding/deleting and modifying data in each row. 映射完成后,如果在每行中添加/删除和修改数据,则需要通过创建每一行来返回行。 Finally apply back the schema so that the dataset knows the schema that it will be returning after performing the map operation. 最后,应用回架构,以便数据集知道执行map操作后将要返回的架构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM