简体   繁体   English

Spark:在每次迭代中RDD缺少条目

[英]Spark: RDD missing entries in each iteration

I am trying to implement a self learning approach for training a classifier. 我正在尝试实施一种自学习方法来训练分类器。 I am using spark 1.6.0. 我正在使用spark 1.6.0。 The problem is when I map a RDD to another I get wrong counts. 问题是,当我将RDD映射到另一个RDD时,我得到了错误的计数。 The same code works OK for small datasets by on larder dataset it just goes nuts. 对于较小的数据集,相同的代码可以正常工作,因为在较大的数据集上它变得很疯狂。

println("INITIAL TRAINING SET SIZE : " + trainingSetInitial.count())
for(counter <- 1 to 10){
  println("-------------------  This is the_" + counter + " run -----------------")
  println("TESTING SET SIZE : "  + testing.count())

  val lowProbabilitiesSet = testing.flatMap { item =>
    if (model.predictProbabilities(item._2)(0) <= 0.75 && model.predictProbabilities(item._2)(1) <= 0.75) {
      List(item._1)
    } else {
      None
    }}.cache()
  val highProbabilitiesSet = testing.flatMap { item =>
    if (model.predictProbabilities(item._2)(0) > 0.75 || model.predictProbabilities(item._2)(1) > 0.75 ) {
      List(item._1 +","+ model.predict(item._2).toDouble )
    } else {
      None
    }}.cache()
  println("LOW PROBAB SET : "  + lowProbabilitiesSet.count())
  println("HIGH PROBAB SET : "  + highProbabilitiesSet.count())

  trainingSetInitial = trainingSetInitial.union(highProbabilitiesSet.map(x => LabeledPoint(List(x)(0).split(",")(8).toString.toDouble, htf.transform(List(x)(0).toString.split(",")(7).split(" ") ))))
  model = NaiveBayes.train(trainingSetInitial, lambda = 1.0)
  println("NEW TRAINING SET : "  + trainingSetInitial.count())

  previousCount = lowProbabilitiesSet.count()
  testing = lowProbabilitiesSet.map { line =>
    val parts = line.split(',')
    val text = parts(7).split(' ')
    (line, htf.transform(text))
  }
  testing.checkpoint()
}

This is the log from the correct output: 这是来自正确输出的日志:

INITIAL TRAINING SET SIZE : 238.182 初始训练套装大小:238.182

------------------- This is the_1 run ----------------- -------------------这是_1次运行-----------------

TESTING SET SIZE : 3.158.722 测试装置尺寸:3.158.722

LOW PROBAB SET : 22.996 低概率套件:22.996

HIGH PROBAB SET : 3.135.726 高概率集:3.135.726

NEW TRAINING SET : 3373908 新训练套装:3373908

------------------- This is the_2 run ----------------- -------------------这是_2次运行-----------------

TESTING SET SIZE : 22996 测试尺寸:22996

LOW PROBAB SET : 566 低概率套件:566

HIGH PROBAB SET : 22430 高概率套装:22430

NEW TRAINING SET : 3396338 新训练套装:3396338

And here is when the problem begins (large dataset input): 这是问题开始的时间(大型数据集输入):

INITIAL TRAINING SET SIZE : 31.990.660 初始训练器械尺寸:31.990.660

------------------- This is the_1 run ----------------- -------------------这是_1次运行-----------------

TESTING SET SIZE : 423.173.780 测试套件大小:423.173.780

LOW PROBAB SET : 62.615.460 低概率集:62.615.460

HIGH PROBAB SET : 360.558.320 高概率集:360.558.320

NEW TRAINING SET : 395265857 新培训套装:395265857

------------------- This is the_2 run ----------------- -------------------这是_2次运行-----------------

TESTING SET SIZE : 52673986 测试套件大小:52673986

LOW PROBAB SET : 51460875 低概率套件:51460875

HIGH PROBAB SET : 1213111 高概率集:1213111

NEW TRAINING SET : 401950263 新训练套装:401950263

The 'LOW PROBAB SET' on the first iteration should be the 'TESTING SET' for the second iteration. 第一次迭代的“ LOW PROBAB SET”应为第二次迭代的“ TESTING SET”。 Somewhere, somehow 10 million entries disappear. 在某个地方,一千万个条目消失了。 Also the 'NEW TRAINING SET' on the 1st iteration should be the concatenation of the 'INITIAL TRAINING' and the 'HIGH PROB SET'. 同样,第一次迭代中的“新训练集”应该是“初始训练”和“高概率集”的串联。 Again the numbers don't match. 同样,数字不匹配。

I did not get any errors while the code was running. 代码运行时没有出现任何错误。 I tried to cache each set and unpersist at the end of each iteration (HIGH and LOW sets only) but same results. 我尝试缓存每个集合,并在每次迭代结束时持久化(仅限高和低集合),但结果相同。 I also tried to checkpoint the sets, didn't work. 我也试图检查集合,没有用。 Why is this happening? 为什么会这样呢?

EDIT 编辑

Just for testing I did not create a new model inside the loop just to see what happens: 仅出于测试目的,我没有在循环内创建新模型只是为了了解发生了什么:

for(counter <- 1 to 5){
  println("-------------------  This is the_" + counter + " run !!! -----------------")
  var updated_trainCnt = temp_train.count();
  var updated_testCnt = test_set.count();
  println("Updated Train SET SIZE: "  + updated_trainCnt)
  println("Updated Testing SET SIZE: "  + updated_testCnt)

  val highProbabilitiesSet = test_set.filter { item =>
    val output = model.predictProbabilities(item._2)
    output(0) > 0.75 || output(1) > 0.75
  }.map(item => (item._1 + "," + model.predict(item._2), item._2 )).cache()

  test_set = test_set.filter { item =>
    val output = model.predictProbabilities(item._2)
    output(0) <= 0.75 && output(1) <= 0.75
  }.map(item => (item._1, item._2)).cache()
  var hiCnt = highProbabilitiesSet.count()
  var lowCnt = test_set.count()
  println("HIGH PROBAB SET : "  + hiCnt)
  println("LOW PROBAB SET  : "  +  lowCnt)
  var diff = updated_testCnt - hiCnt - lowCnt
  if (diff!=0) println("ERROR: Test set not correctly split into high low" + diff)
  temp_train= temp_train.union(highProbabilitiesSet.map(x => LabeledPoint(x._1.toString.split(",")(8).toDouble, x._2 ))).cache()
  println("NEW TRAINING SET: "  + temp_train.count())
//      model = NaiveBayes.train(temp_train, lambda = 1.0, modelType = "multinomial")
  println("HIGH PROBAB SET : "  + highProbabilitiesSet.count())
  println("LOW PROBAB SET  : "  + test_set.count())
  println("NEW TRAINING SET: "  + temp_train.count())
}

The produced numbers, from the original model were OK even the union of RDDs was performed without an issue. 即使执行RDD的合并没有问题,从原始模型产生的数字也可以。 But the big question remains, how does the classification model mess the training set(lowProbabilititesSet) without even modifying it at the end of each loop (or the other RDDs)? 但是,仍然存在一个大问题,分类模型如何使训练集(lowProbabilititesSet)弄乱,甚至在每个循环(或其他RDD)结束时都不会对其进行修改?

Console logs and spark logs do not show any error or executioner crush. 控制台日志和Spark日志不显示任何错误或执行者粉碎。 How does the classification training process corrupt my data ? 分类训练过程如何破坏我的数据?

Even though I still haven't figured out why is this happening as a hack I flushed the RDDs to the HDFS and made a bash script which runs the class iteratively and reads the data from HDFS every time. 即使我仍然没有弄清楚为什么会这样,但是我还是将RDD刷新到HDFS并制作了一个bash脚本,该脚本反复运行该类并每次从HDFS读取数据。 As I figured out the problem appears when I train the classifier inside the loop. 正如我发现的那样,当我在循环中训练分类器时,就会出现问题。

I don't see the problem right away. 我没有立即看到问题。 Please minimise the code to the actual problem. 请最小化代码以解决实际问题。 First thing I would suggest to rewrite the flatMap operations to a filter : 首先,我建议将flatMap操作重写为filter

val highProbabilitiesSet = testing.flatMap { item =>
  if (model.predictProbabilities(item._2)(0) > 0.75 || model.predictProbabilities(item._2)(1) > 0.75 ) {
      List(item._1 +","+ model.predict(item._2).toDouble )
  } else {
    None
  }
}.cache()

To: 至:

val highProbabilitiesSet = testing.filter { item => 
  val output = model.predictProbabilities(item._2)
  output(0) > 0.75 || output(1) > 0.75
}.map(item => (item._1, model.predict(item._2).toDouble)).cache()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM