简体   繁体   English

将DataFrame操作应用于mapWithState中的单个行

[英]Applying DataFrame operations to a Single row in mapWithState

I'm on spark 2.1.0 with Scala 2.11. 我在使用Scala 2.11的Spark 2.1.0中。 I have a requirement to store state in Map[String, Any] format for every key. 我需要为每个键以Map [String,Any]格式存储状态。 The right candidate to solve my problem appears to be mapWithState() which is defined in PairDStreamFunctions . 解决我的问题的合适人选似乎是在PairDStreamFunctions定义的mapWithState() The DStream on which I am applying mapWithState() is of type DStream[Row] . 我在其上应用mapWithState()DStream[Row]类型为DStream[Row] Before applying mapWithState() , I do this: 在应用mapWithState()之前,我这样做:

dstream.map(row=> (row.get(0), row))

Now my DStream is of type Tuple2[Any, Row] . 现在我的DStream类型为Tuple2[Any, Row] On this DStream I apply mapWithState and here's how my updatefunction looks: 在此DStream上,我应用mapWithState,这是我的updatefunction的外观:

def stateUpdateFunction(): (Any, Option[Row], State[Map[String, Any]]) => Option[Row] = {
  (key, newData, stateData) => {
    if (stateData.exists()) {
      var oldState = stateData.get()
      stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
      Some(Row.fromSeq(newData.get.toSeq.++(Seq(oldState.get("count").get, oldState.get("sum").get))))
    }
    else {
      stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
      Some(Row.fromSeq(newData.get.toSeq.++(Seq[Any](null, null))))
    }
  }
}

Right now, update function only stores 2 values (per key) in the Map and appends the old values stored against "count" and "sum" to the input Row and returns. 现在,更新功能仅在Map中存储2个值(每个键),并将针对“ count”和“ sum”存储的旧值附加到输入Row并返回。 The state Map gets updated by the newly passed values in the input Row. 状态映射将通过输入行中新传递的值进行更新。 My requirement is to be able to do complex operations on the input Row like we do on a DataFrame before storing them in the state Map. 我的要求是能够像在DataFrame上一样对输入Row执行复杂的操作,然后再将它们存储在状态Map中。 In other words I would like to be able to do something like this: 换句话说,我希望能够执行以下操作:

var transformedRow = originalRow.select(concat(upper($"C0"), lit("dummy")), lower($"C1") ...)

In the update-function I don't have access to SparkContext or SparkSession. 在更新功能中,我无权访问SparkContext或SparkSession。 So, I cannot create a single row DataFrame. 因此,我无法创建单行DataFrame。 If I could do that, applying DataFrame operations would not be difficult. 如果我可以做到,那么应用DataFrame操作将不会很困难。 I have all the column expressions defined for the transformed row. 我为转换后的行定义了所有列表达式。

Here's my sequence of operations: readState-> Perform complex DataFrame operations using this state on input row -> Perform more complex DataFrame operations to define new values for state. 这是我的操作顺序:readState->在输入行上使用此状态执行复杂的DataFrame操作->执行更复杂的DataFrame操作以定义状态的新值。

Is it possible to fetch the SparkPlan/logicalPlan corresponding to a DataFrame query/operation and apply it on a single spark-sql Row ? 是否可以获取与DataFrame查询/操作相对应的SparkPlan / logicalPlan并将其应用于单个spark-sql Row? I would very much appreciate any leads here. 我非常感谢任何潜在客户。 Please let me know if the question is not clear or some more details are required. 如果问题不清楚或需要更多详细信息,请告诉我。

I've found a not-so-efficient solution to the given problem. 对于给定的问题,我发现了一种不太有效的解决方案。 With the known DataFrame operations we have, we can create an empty DataFrame with an already known schema. 使用已知的DataFrame操作,我们可以使用已知的模式创建一个空的DataFrame。 This DataFrame can give us the SparkPlan through 该DataFrame可以通过以下方式为我们提供SparkPlan

DataFrame.queryExecution.sparkPlan

This object is serializable and can be passed over to stateUpdateFunction. 该对象是可序列化的,可以传递给stateUpdateFunction。 In the stateUpdateFunction, we can iterate over expressions contained in the passed SparkPlan, transforming it to replace unresolved attributes with corresponding literals: 在stateUpdateFunction中,我们可以遍历传递的SparkPlan中包含的表达式,将其转换为用相应的文字替换未解析的属性:

sparkPlan.expressions.map(expr=>{
            expr.transform{
              case attr: AttributeReference =>
                println(s"Resolving ${attr.name} to ${map.getOrElse(attr.name, null)}, type: ${map.getOrElse(attr.name, null).getClass().toString()}")
                Literal(map.getOrElse(attr.name, null))
              case a => a
            }
          })

The map here refers to Row's column-value pairs. 此处的映射指的是Row的“列-值”对。 On these transformed expressions we call eval passing it empty InternalRow. 在这些转换后的表达式上,我们称为eval将其传递为空的InternalRow。 This gives us results corresponding to every expression. 这为我们提供了与每个表达式相对应的结果。 Because this method involves interpreted evaluation and doesn't employ code generation, it will be inefficient to use this in a real world use-case. 由于此方法涉及解释性评估,并且不使用代码生成,因此在实际的用例中使用此方法将效率低下。 But I'll dig further to find out how code generation can be leveraged here. 但是,我将在这里进一步挖掘发现如何利用代码生成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM