简体繁体 English

基于输入数据的scala / spark不同案例类

[英]scala/spark different case class based on input data

原文 2018-03-04 13:26:02 5 1 scala/ apache-spark

In spark / scala, how to do data driven instantiation of the case classes? 在spark / scala中，如何执行案例类的数据驱动实例化？

Explanation: Let's say we have an input dataset of some kind of contracts (eg telecom subscriptions) and those contracts need to be somehow evaluated. 说明：假设我们有某种合同（例如，电信订户）的输入数据集，并且这些合同需要以某种方式进行评估。 Input dataset contains values like date of creation, start of the validity of contract, end of validity, some amounts, additional options, family discount etc. which all don't have to be filled (eg some contracts don't have additional options) 输入数据集包含值，例如创建日期，合同有效期的开始，有效期的结束，某些金额，附加选项，家庭折扣等，这些值都不需要填写（例如，某些合同没有附加选项）

Does it make sense to model all type of contracts using case classes? 使用案例类对所有类型的合同建模是否有意义？ So, one input row coming from the dataset could be a contract for fixed line, or mobile number or some other service. 因此，来自数据集的一个输入行可能是固定电话，手机号码或其他服务的合同。 Then i'd try to deduct the most details the input row has and instantiate appropriate case class using match? 然后，我尝试扣除输入行的最详细信息，并使用match实例化适当的案例类？ Each of these case classes would have a functions that returns a value of the contract based on this data and some static data coming from elsewhere (a lookup table, maybe k,v map). 这些案例类中的每一个都有一个函数，该函数根据此数据和来自其他地方的一些静态数据（查找表，可能是k，v映射）返回合同的值。 This function would then be used in a call to dataset 'map'. 然后将在调用数据集“ map”时使用此函数。 Better way to do this? 更好的方法吗？

Given that the case classes idea makes sense, each class could also do simulations on the same input data. 考虑到案例类的想法是有意义的，每个类也可以对相同的输入数据进行仿真。 Eg what if customer downgrades his internet speed, what would then be estimated income for this contract? 例如，如果客户降低其互联网速度，该合同的预计收入将如何？ So for one input row, i'd have to return 2 new columns: value of the contract and simulated value of the contract. 因此，对于一个输入行，我将必须返回2个新列：合同价值和合同模拟价值。 Doing 'what if' scenarios, it could also be that for one input row i do several scenarios (at once?) which would than return several rows (eg 1. what if the customer buys something more; 2. what if customer downgrades; 3. what if customer cancels all additional options on the contract). 进行“假设”场景，也可能是对于一个输入行，我同时执行了多个场景（而不是返回几行）（例如1.如果客户购买了更多东西该怎么办； 2.如果客户降级怎么办； 3.如果客户取消了合同上的所有其他选择，该怎么办？

Is this even the right approach to problem? 这甚至是解决问题的正确方法吗？ How to make these evaluations 'data driven' since input values drive which case class it is and configuration/run options drive how many times a 'map' on the dataset should be triggered? 由于输入值驱动的是哪个案例类，而配置/运行选项驱动应该触发数据集上的“映射”多少次，如何使这些评估成为“数据驱动”呢？

1 个解决方案

Modeling huge amount of different combination of products into a class hierarchy tree is not pragmatic. 将大量不同的产品组合建模到类层次结构树中并不实用。

Solution that worked is to have nested classes. 有效的解决方案是具有嵌套类。

So, from one input row, columns would be grouped into different objects that make sense and those would be data members of the parent class. 因此，从一个输入行开始，列将被分组为有意义的不同对象，而这些对象将成为父类的数据成员。

I've tried this on banking contracts instead of telecom contracts (as used in the question): if there is a contract for a loan which is delivered in one row in a dataframe, columns of that one row can be grouped into maturity information, interest information etc. Each of these information groups has its own class and methods. 我已经尝试过使用银行合同而不是电信合同（在问题中使用过）：如果有一个贷款合同在数据帧的一行中交付，那么该行的列可以分组为到期信息，兴趣信息等。每个信息组都有其自己的类和方法。 Instance of these classes become a data member of the parent Loan class. 这些类的实例成为父贷款类的数据成员。

This way i could model different interest behavior, maturity behavior etc and just call it in the .map from the Loan object itself. 这样，我可以对不同的利息行为，到期行为等进行建模，并从贷款对象本身在.map中调用它。