简体   繁体   English

Spark 中的并行 FP 增长

[英]Parallel FP Growth in Spark

I am trying to understand the "add" and "extract" methods of the FPTree class: ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala ).我试图了解 FPTree class 的“添加”和“提取”方法:( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ mllib/fpm/FPGrowth.scala )。

  1. What is the purpose of 'summaries' variable? “摘要”变量的目的是什么?
  2. where is the Group list?组列表在哪里? I assume it is the following, am I correct:我假设它是以下内容,我是否正确:
 val numParts = if (numPartitions > 0) numPartitions else data.partitions.length val partitioner = new HashPartitioner(numParts)
  1. What will 'summaries contain for 3 transactions of {a,b,c}, {a,b}, {b,c} where all are frequent?对于 {a,b,c}、{a,b}、{b,c} 的 3 个事务,所有频繁发生的“摘要”将包含什么?
 def add(t: Iterable[T], count: Long = 1L): FPTree[T] = { require(count > 0) var curr = root curr.count += count t.foreach { item => val summary = summaries.getOrElseUpdate(item, new Summary) summary.count += count val child = curr.children.getOrElseUpdate(item, { val newNode = new Node(curr) newNode.item = item summary.nodes += newNode newNode }) child.count += count curr = child } this } def extract( minCount: Long, validateSuffix: T => Boolean = _ => true): Iterator[(List[T], Long)] = { summaries.iterator.flatMap { case (item, summary) => if (validateSuffix(item) && summary.count >= minCount) { Iterator.single((item:: Nil, summary.count)) ++ project(item).extract(minCount).map { case (t, c) => (item:: t, c) } } else { Iterator.empty } } }

After a bit experiments, it is pretty straight forward:经过一些实验,它非常简单:

1+2) The partition is indeed the Group representative. 1+2) 分区确实是集团代表。 It is also how the conditional transactions calculated:这也是条件交易的计算方式:

  private def genCondTransactions[Item: ClassTag](
      transaction: Array[Item],
      itemToRank: Map[Item, Int],
      partitioner: Partitioner): mutable.Map[Int, Array[Int]] = {
    val output = mutable.Map.empty[Int, Array[Int]]
    // Filter the basket by frequent items pattern and sort their ranks.
    val filtered = transaction.flatMap(itemToRank.get)
    ju.Arrays.sort(filtered)
    val n = filtered.length
    var i = n - 1
    while (i >= 0) {
      val item = filtered(i)
      val part = partitioner.getPartition(item)
      if (!output.contains(part)) {
        output(part) = filtered.slice(0, i + 1)
      }
      i -= 1
    }
    output
  }
  1. The summaries is just a helper to save the count of items in transaction The extract/project will generate the FIS by using up/down recursion and dependent FP-Trees (project), while checking summaries if traversal that path is needed.摘要只是保存事务中项目计数的助手。提取/项目将通过使用向上/向下递归和依赖的 FP 树(项目)生成 FIS,同时检查摘要是否需要遍历该路径。 summaries of node 'a' will have {b:2,c:1} and children of node 'a' are 'b' and 'c'.节点“a”的摘要将具有 {b:2,c:1},节点“a”的子节点是“b”和“c”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM