简体   繁体   English

SPARK N元语法和并行化未使用mapPartitions

[英]SPARK N-grams & Parallelization not using mapPartitions

Problem at Hand Wrote an attempted improved bi-gram generator working over lines, taking into account full stops and the like. 眼前的问题编写了一种尝试的改进的二元语法生成器,该生成器在线路上工作,同时考虑了句号等。 Results are as wanted. 结果是想要的。 It does not use mapPartitions but is as per below. 它不使用mapPartitions,但如下所示。

import org.apache.spark.mllib.rdd.RDDFunctions._

val wordsRdd = sc.textFile("/FileStore/tables/natew5kh1478347610918/NGram_File.txt",10)  
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!
{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".")filter(_ != "")  

val x = wordsRDDTextSplit.collect() // need to do this due to lazy evaluation etc. I think, need collect()
val y = for ( Array(a,b,_*) <- x.sliding(2).toArray) 
yield (a, b) 
  val z = y.filter(x => !(x._1 contains ".")).map(x => (x._1.replaceAll("\\.{1,}",""), x._2.replaceAll("\\.{1,}","")))

I have some questions: 我有一些疑问:

  1. The results are as expected. 结果是预期的。 No data is missed. 没有数据丢失。 But can I convert such an approach to a mapPartitions approach? 但是我可以将这种方法转换为mapPartitions方法吗? Would I not lose some data? 我不会丢失一些数据吗? Many say that that this is the case due to the partitions that we would be processing having a subset of all the words and hence missing the relationship at a boundary of the split, ie.the next and the previous word. 许多人说是这种情况,这是由于要处理的分区具有所有单词的子集,因此缺少了拆分边界(即下一个和上一个单词)的关系。 With a large file split I can see from the map point of view this could occur as well. 对于较大的文件拆分,我可以从地图的角度看到这也可能发生。 Correct? 正确?

  2. However, if you look at the code above (no mapPartitions attempt), it always works regardless of how much I parallelize this, 10 or 100 specified with partitions with words that are consecutive over different partitions. 但是,如果您查看上面的代码(没有mapPartitions尝试),则无论我并行化了多少,它始终可以工作,其中10或100用分区连续指定的单词指定。 I checked this with mapPartitionsWithIndex. 我用mapPartitionsWithIndex进行了检查。 This I am not clear on. 我不清楚。 OK, a reduce on (x, y) => x + y is well understood. 好的,对(x,y)=> x + y的约简是很容易理解的。

Thanks in advance. 提前致谢。 I must be missing some elementary point in all this. 我必须在所有这些方面中缺少一些基本要点。

Output & Results z: Array[(String, String)] = Array((hello,how), (how,are), (are,you), (you,today), (i,am), (am,fine), (fine,but), (but,would), (would,like), (like,to), (to,talk), (talk,to), (to,you), (you,about), (about,the), (the,cat), (he,is), (is,not), (not,doing), (doing,so), (so,well), (what,should), (should,we), (we,do), (please,help), (help,me), (hi,there), (there,ged)) mapped: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[669] at mapPartitionsWithIndex at :123 输出和结果 z:Array [(String,String)] = Array((hello,how),(how,are),(are,you),(you,today),(i,am),(am,fine ),(很好,但是),(但是,会),(会,喜欢),(喜欢,到),(去,说话),(会说话,到),(到,你),(你,关于), (关于),(猫),(他是),(不是),(不做),(正在做),(如此),(应该),(应该) ,we),(we,do),(please,help),(help,me),(hi,there),(there,ged))映射:org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [669]在mapPartitionsWithIndex处:123

Partition Assignment res13: Array[String] = Array(hello -> 0, how -> 0, are -> 0, you -> 0, today. -> 0, i -> 0, am -> 32, fine -> 32, but -> 32, would -> 32, like -> 32, to -> 32, talk -> 60, to -> 60, you -> 60, about -> 60, the -> 60, cat. -> 60, he -> 60, is -> 60, not -> 96, doing -> 96, so -> 96, well. -> 96, what -> 96, should -> 122, we -> 122, do. -> 122, please -> 122, help -> 122, me. -> 122, hi -> 155, there -> 155, ged. -> 155) 分区分配 res13:Array [String] = Array(你好-> 0,如何-> 0,是-> 0,你-> 0,今天。-> 0,i-> 0,am-> 32,很好-> 32,但-> 32,会-> 32,例如-> 32,到-> 32,通话-> 60,到-> 60,您-> 60,大约-> 60,-> 60,猫- > 60,他-> 60,是-> 60,不是-> 96,正在执行-> 96,所以-> 96,很好-> 96,什么-> 96,应该-> 122,我们-> 122, -> 122,请-> 122,帮助-> 122,我-> 122,嗨-> 155,那里-> 155,年龄-> 155)

May be SPARK is just really smart, smarter than I thought initially. 可能是SPARK真的很聪明,比我最初想象的要聪明。 Or may be not? 或者可能不是? Saw some stuff on partition preservation, some of it contradictory imho. 在分区保存方面看到了一些东西,其中有些矛盾。

map vs mapValues meaning former destroys partitioning and hence single partition processing? map vs mapValues意味着前者破坏了分区,因此破坏了单个分区?

You can use mapPartitions on in place of any of the maps used to create wordsRDDTextSplit , but I don't really see any reason to. 您可以使用mapPartitions代替用于创建wordRDDTextSplit的任何地图,但是我真的看不出有任何理由。 mapPartitions is most useful when you have a high initialization cost that you don't want to pay for every record in the RDD. 当您不想为RDD中的每条记录支付高昂的初始化费用时, mapPartitions最为有用。

Whether you use map or mapPartitions to create wordsRDDTextSplit , your sliding window doesn't operate on anything until you create the local data structure x . 无论您使用map还是mapPartitions创建wordRDDTextSplit ,您的滑动窗口都不会对任何操作,除非您创建本地数据结构x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM