简体   繁体   中英

scala/spark different case class based on input data

In spark / scala, how to do data driven instantiation of the case classes?

Explanation: Let's say we have an input dataset of some kind of contracts (eg telecom subscriptions) and those contracts need to be somehow evaluated. Input dataset contains values like date of creation, start of the validity of contract, end of validity, some amounts, additional options, family discount etc. which all don't have to be filled (eg some contracts don't have additional options)

Does it make sense to model all type of contracts using case classes? So, one input row coming from the dataset could be a contract for fixed line, or mobile number or some other service. Then i'd try to deduct the most details the input row has and instantiate appropriate case class using match? Each of these case classes would have a functions that returns a value of the contract based on this data and some static data coming from elsewhere (a lookup table, maybe k,v map). This function would then be used in a call to dataset 'map'. Better way to do this?

Given that the case classes idea makes sense, each class could also do simulations on the same input data. Eg what if customer downgrades his internet speed, what would then be estimated income for this contract? So for one input row, i'd have to return 2 new columns: value of the contract and simulated value of the contract. Doing 'what if' scenarios, it could also be that for one input row i do several scenarios (at once?) which would than return several rows (eg 1. what if the customer buys something more; 2. what if customer downgrades; 3. what if customer cancels all additional options on the contract).

Is this even the right approach to problem? How to make these evaluations 'data driven' since input values drive which case class it is and configuration/run options drive how many times a 'map' on the dataset should be triggered?

Modeling huge amount of different combination of products into a class hierarchy tree is not pragmatic.

Solution that worked is to have nested classes.

So, from one input row, columns would be grouped into different objects that make sense and those would be data members of the parent class.

I've tried this on banking contracts instead of telecom contracts (as used in the question): if there is a contract for a loan which is delivered in one row in a dataframe, columns of that one row can be grouped into maturity information, interest information etc. Each of these information groups has its own class and methods. Instance of these classes become a data member of the parent Loan class.

This way i could model different interest behavior, maturity behavior etc and just call it in the .map from the Loan object itself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM