简体   繁体   中英

Flink datastream keyby using composite key

My question is very similar to How to support multiple KeyBy in Flink , just that that question is for Java, i need the answer in Scala. I copy pasted the provided solution in IntelliJ, it auto-converted the copy pasted snippet to Scala which i then edited to fit my code. I still get compilation errors (even before compilation IntelliJ is able to detect a problem with the code). Basically the argument provided to keyBy (return value of keySelector's getKey function) does not match arguments expected by any overloaded version of the keyBy function.

Looked up many examples of scala code sample for KeySelector which returns a composite key, did not find any.

import org.apache.flink.api.java.functions.KeySelector
import org.apache.flink.api.java.tuple.Tuple2
import org.myorg.aarna.AAPerMinData
val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy(new 
    KeySelector[AAPerMinData, Tuple2[String, String]]() {
    @throws[Exception]
    override def getKey(value: AAPerMinData): Tuple2[String, String] = 
    Tuple2.of(value.field1, value.field2)  
})

I get the following error on compiling the code:

Error:(213, 64) overloaded method value keyBy with alternatives:
[K](fun: org.myorg.aarna.AAPerMinData => K)(implicit evidence $2:org.apache.flink.api.common.typeinfo.TypeInformation[K])org.apache.flink.streaming.api.scala.KeyedStream[org.myorg.aarna.AAPerMinData,K] <and>
(firstField: String,otherFields: 
String*)org.apache.flink.streaming.api.scala.KeyedStream[org.myorg.aarna.AAPerMinData,org.apache.flink.api.java.tuple.Tuple] <and>
(fields: Int*)org.apache.flink.streaming.api.scala.KeyedStream[org.myorg.aarna.AAPerMinData,org.apache.flink.api.java.tuple.Tuple]
cannot be applied to (org.apache.flink.api.java.functions.KeySelector[org.myorg.aarna.AAPerMinData,org.apache.flink.api.java.tuple.Tuple2[String,String]])
val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy(new KeySelector[AAPerMinData, Tuple2[String, String]]() {

I am not sure what am i missing in the syntax which is causing this error. Any help is greatly appreciated. Next step once this is resolved to do a TumblingWindow based summarization based on the composite key.

Update 1 (12/29/2018): changed the code to use a simple String type field as a key using the KeySelector format (i understand that this can be done in a much simpler way, i am doing it this way just to get a basic KeySelector working).

  import org.apache.flink.api.java.functions.KeySelector
  import org.myorg.aarna.AAPerMinData
  val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy(new KeySelector[AAPerMinData, String]() {
    @throws[Exception]
    override def getKey(value: AAPerMinData): String = value.set1.sEntId
  })

Here is a screenshot of the error i get (ie IntelliJ shows this on a mouseover). 在此处输入图片说明

Update 2 (12/29/2018)

This works (for the single key case)

val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy[String] 
(_.set1.sEntId)

This does not work (for the composite key case)

val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy([String, String)](_.set1.sEntId, _.set1.field2)

Update 3 (12/29/2018) Tried the following, could not get it to work. See error screenshot.

val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy[(String, String)]((_.set1.sEntId, _.set1.field2))

在此处输入图片说明

Update 4 (12/30/2018) Resolved now, see accepted answer. For anyone who may be interested, this is the final working code including using the composite key for aggregation:

// Composite key
val aa_stats_keyed_stream = aa_stats_stream_w_timestamps.keyBy[(String, String)](x => (x.set1.sEntId, x.set1.field2))

// Tumbling window
val aggr_keyed_stream = aa_stats_keyed_stream.window(TumblingEventTimeWindows.of(Time.seconds(60)))

// all set for window based aggregation of a "composite keyed" stream
val aggr_stream = aggr_keyed_stream.apply { (key: (String, String), window: TimeWindow, events: Iterable[AAPerMinData],
                                                 out: Collector[AAPerMinDataAggr]) =>
      out.collect(AAPerMinDataAggrWrapper(key._1 + key._2, // composite
        key._1, key._2, // also needed individual pieces
        window,
        events,
        stream_deferred_live_duration_in_seconds*1000).getAAPerMinDataAggr)}
// print the "mapped" stream for debugging purposes
aggr_stream.print()

First of all, while it's not necessary, go ahead and use Scala tuples. It'll make things easier overall, unless you have to interoperate with Java Tuples for some reason.

And then, don't use org.apache.flink.api.java.functions.KeySelector. You want to be using this keyBy from org.apache.flink.streaming.api.scala.DataStream:

/**
 * Groups the elements of a DataStream by the given K key to
 * be used with grouped operators like grouped reduce or grouped aggregations.
 */
def keyBy[K: TypeInformation](fun: T => K): KeyedStream[T, K] = {

  val cleanFun = clean(fun)
  val keyType: TypeInformation[K] = implicitly[TypeInformation[K]]

  val keyExtractor = new KeySelector[T, K] with ResultTypeQueryable[K] {
    def getKey(in: T) = cleanFun(in)
    override def getProducedType: TypeInformation[K] = keyType
  }
  asScalaStream(new JavaKeyedStream(stream, keyExtractor, keyType))
}

In other words, just pass a function that transforms your stream elements into key values (in general, Flink's scala API tries to be idiomatic). So something like this should do the job:

aa_stats_stream_w_timestamps.keyBy[String](value => value.set1.sEntId)

Update:

For the composite key case, use

aa_stats_stream_w_timestamps.keyBy[(String, String)](x => (x.set1.sEntId, x.set1.field2))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM