[英]Compare RDD Objects - Apache Spark
I'm fairly new into the apache spark technology and I'm having some problems while trying to analyze data I'm pulling from my files. 我是apache spark技术的新手,在尝试分析从文件中提取的数据时遇到了一些问题。
I have big list of genes information, and I'm pulling that information to a RDD, so far so good. 我拥有大量的基因信息,目前为止我正在将这些信息纳入RDD。
JavaRDD<Gene> inputfile = sc.textFile(logFile).map(
new Function<String, Gene>() {
@Override
public Gene call(String line) throws Exception {
String[] values = line.split("\t");
Gene gen = null;
//We are only interested in genes;
if( values.length > 2 && values[2].equalsIgnoreCase("gene") && !line.contains("#")){
String[] infoGene = values[8].split(";");
String geneId = StringUtils.substringBetween(infoGene[0], "\"");
String geneType = StringUtils.substringBetween(infoGene[2], "\"");
String geneName = StringUtils.substringBetween(infoGene[4], "\"");
gen = new Gene(geneName,values[3],values[4]);
return gen;
}
return gen;
}
}
).filter(new Function<Gene, Boolean>() {
@Override
public Boolean call(Gene gene) throws Exception {
if(gene == null)
return false;
else
return true;
}
});
The Gene class: 基因类:
public class Gene implements Serializable{
String firstBp;
String lastBp;
String name;
public Gene(String name, String firstBp, String lastBp) {
this.name = name;
this.firstBp = firstBp;
this.lastBp = lastBp;
}
public String getFirstBp() {
return firstBp;
}
public String getLastBp() {
return lastBp;
}
public String getName() {
return name;
}
public String toString(){
return name + " " + firstBp + " " + lastBp;
}}
The problem starts here, I need to analyze if 2 Genes overlay, and for that I've made this simple utility function: 问题从这里开始,我需要分析2个基因是否重叠,为此,我做了一个简单的实用函数:
public static Boolean isOverlay(Gene gene1, Gene gene2){
int gene1First = Integer.parseInt(gene1.getFirstBp());
int gene1Last = Integer.parseInt(gene1.getLastBp());
int gene2First = Integer.parseInt(gene2.getFirstBp());
int gene2Last = Integer.parseInt(gene2.getLastBp());
if(gene2First >= gene1First && gene2First <= gene1Last) // FirstBp - Gene2 inside
return true;
else if (gene2Last >= gene1First && gene2Last <= gene1Last) // LastBP - Gene2 inside
return true;
else if (gene1First >= gene2First && gene1First <= gene2Last) // FirstBp - Gene1 inside
return true;
else if (gene1Last >= gene2First && gene1Last <= gene2Last) // LastBP - Gene1 inside
return true;
else
return false;
}
Now what I'm doing and I think is wrong is transforming the RDD Object into a list by doing: 现在我正在做的事,我认为是错误的,是通过执行以下操作将RDD对象转换为列表:
List<Gene> genesList = inputfile.collect();
And iterate over that list to check if there are overlays and save to the file the results which is taking ages because I'm not using spark. 并遍历该列表以检查是否有覆盖层,并将结果花了很长时间才能保存到文件中,因为我没有使用spark。
List<OverlayPair> overlayPairList= new ArrayList<OverlayPair>();
List<String> visitedGenes = new ArrayList<String>();
for (Gene gene1 : genesList){
for (Gene gene2 : genesList) {
if (gene1.getName().equalsIgnoreCase(gene2.getName()) || visitedGenes.contains(gene2.getName())) {
continue;
}
if (isOverlay(gene1, gene2))
overlayPairList.add(new OverlayPair(gene1.getName(), gene2.getName()));
}
visitedGenes.add(gene1.getName());
}
JavaRDD<OverlayPair> overlayFile = sc.parallelize(overlayPairList);
//Export the results to the file
String outputDirectory = "/Users/joaoalmeida/Desktop/Dissertacao/sol/data/mitocondrias/feup-pp/project/data/output/overlays";
overlayFile.coalesce(1).saveAsTextFile(outputDirectory);
The Overlay pair is basically an object with the 2 genes name. 覆盖对基本上是具有2个基因名称的对象。
Is there anyway to do this 2nd part while taking advantage of spark? 无论如何,在利用火花的同时,是否要执行第二部分? Because the time complexity of those 2 for's its to big for the amount of data I currently have.
因为这2 for的时间复杂度对于我当前拥有的数据量来说很大。
是的,您必须使用RDD.cartesian
函数来获取所有对,然后才能基本上应用所编写的函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.