如何使用Scala在Spark中评估minHashLSH？

Question

I have a dataset of academic papers and it has 27770 papers (nodes) and another file (graph file) with the original edges with 352807 entries long. 我有一个学术论文集，它有27770篇论文（节点）和另一个文件（图形文件），其原始边缘长352807个条目。 I want to calculate minHashLSH to find similar documents and predict links between tow nodes! 我想计算minHashLSH来查找相似的文档并预测两个节点之间的链接！ Bellow you can see my try about implementing this on spark with scala. 在下面，您可以看到我的尝试在Scala的Spark上实现此功能。 The problem which I am facing is that I don't know how to evaluate the results! 我面临的问题是我不知道如何评估结果！

def main(args: Array[String]): Unit = {
println("MinHash LSH")
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors


val ss = SparkSession.builder().master("local[*]").appName("neighbors").getOrCreate()
val sc = ss.sparkContext

val inputFile = "resources/data/node_information.csv"

println("reading from input file: " + inputFile)
println

val schemaStruct = StructType(
  StructField("id", IntegerType) ::
    StructField("pubYear", StringType) ::
    StructField("title", StringType) ::
    StructField("authors", StringType) ::
    StructField("journal", StringType) ::
    StructField("abstract", StringType) :: Nil
)

// Read the contents of the csv file in a dataframe. The csv file contains a header.
var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile)

import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
  val fields = line.split("\t")
  (fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id")

println("Original graph edges count: " + originalGraphDF.count())
originalGraphDF.printSchema()
originalGraphDF.show(5)

val t1 = System.nanoTime // Start point of the app

val nullAuthor = "NO AUTHORS"
val nullJournal = "NO JOURNAL"
val nullAbstract = "NO ABSTRACT"

papers = papers.na.fill(nullAuthor, Seq("authors"))
papers = papers.na.fill(nullJournal, Seq("journal"))
papers = papers.na.fill(nullAbstract, Seq("abstract"))

papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
papers.show()

papers = papers.na.drop()
val fraction = 0.1

papers = papers.sample(fraction, 12345L)
//    println(papers.count())

//TOKENIZE

val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")

//REMOVE STOPWORDS

val rTitle = new StopWordsRemover().setInputCol("title_words").setOutputCol("title_words_f")
val rAuthors = new StopWordsRemover().setInputCol("authors_words").setOutputCol("authors_words_f")
val rJournal = new StopWordsRemover().setInputCol("journal_words").setOutputCol("journal_words_f")
val rAbstract = new StopWordsRemover().setInputCol("abstract_words").setOutputCol("abstract_words_f")

println("Setting pipeline stages...")
val stages = Array(
  tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract,
  rTitle, rAuthors, rJournal, rAbstract
)

val pipeline = new Pipeline()
pipeline.setStages(stages)

println("Transforming dataframe")
val model = pipeline.fit(papers)
papers = model.transform(papers)

papers.show(5)

//newDf = node df
val newDf = papers.select("id", "pubYear_words", "title_words_f", "authors_words_f", "journal_words_f", "abstract_words_f")
newDf.show(5)
newDf.describe().show()

val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))

val joinedDf = newDf.withColumn(
  "paper_data",
  udf_join_cols(
    newDf("pubYear_words"),
    newDf("title_words_f"),
    newDf("authors_words_f"),
    newDf("journal_words_f"),
    newDf("abstract_words_f"
    )
  )
).select("id", "paper_data")

joinedDf.show(5)
joinedDf.printSchema()
println(joinedDf.count())

// Word count to vector for each wiki content
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize)
  .setMinDF(10).fit(joinedDf)

val vectorizedDf = cvModel.transform(joinedDf).select(col("id"), col("features"))
vectorizedDf.show()
println("Total entries: "+vectorizedDf.count())

val mh = new MinHashLSH().setNumHashTables(3)
  .setInputCol("features").setOutputCol("hashValues")
val mhModel = mh.fit(vectorizedDf)

mhModel.transform(vectorizedDf).show()

// Self Join
val threshold = 0.95

val predictinsDF = mhModel.approxSimilarityJoin(vectorizedDf, vectorizedDf, 1,"JaccardDistance")
  .select("datasetA.id","datasetB.id","JaccardDistance").filter("JaccardDistance >= "+threshold)
  .withColumnRenamed("datasetA.id","nodeA_id")
  .withColumnRenamed("datasetB.id","nodeB_id")

predictinsDF.show()
predictinsDF.printSchema()
println("Total edges found: "+predictinsDF.count())  }

The origina graph is a file with to form of nodeAId, nodeBId. 原始图是一个格式为nodeAId，nodeBId的文件。 My results are in form of nodeAId, nodeBId, JaccardSimilarity. 我的结果采用nodeAId，nodeBId，JaccardSimilarity的形式。 Botho of them are dataframes. 它们都是数据帧。 How can I evaluate my results and get Accuracy or F1 score? 如何评估结果并获得准确性或F1分数？

I have read how to find Accuracy and F1 score, so I tried to make a function to calculate them. 我已经阅读了如何找到Accuracy和F1分数，因此我尝试创建一个函数来计算它们。 My approach is the code bellow. 我的方法是下面的代码。

def getStats(spark:SparkSession,nodeDF:DataFrame, pairsDF:DataFrame, predictionsDF:DataFrame, graphDF:DataFrame): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)

import spark.implicits._
val truePositives = graphDF.as("g").join(predictionsDF.as("p"),
  ($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
).count()

val df = pairsDF.as("p").join(graphDF.as("g"),
  ($"p.nodeA_id" === $"g.nodeA_id" && $"p.nodeB_id" === $"g.nodeB_id") || ($"p.nodeA_id" === $"g.nodeB_id" && $"p.nodeB_id" === $"g.nodeA_id")
).count()
println("True Positives: "+truePositives)

val falsePositives = predictionsDF.count() - truePositives
println("False Positives: "+falsePositives)

val trueNegatives = (pairsDF.count() - graphDF.count()) - falsePositives
println("True Negatives: "+trueNegatives)

val falseNegatives = graphDF.count()-truePositives
println("False Negatives: "+falseNegatives)

val truePN = (truePositives+trueNegatives).toFloat

val sum = (truePN + falseNegatives+ falsePositives).toFloat

val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy)

val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat

val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score) }

But, when I try to run it, it will never ends!! 但是，当我尝试运行它时，它将永远不会结束！ I dont know how to imrove this or to fix it in order to get tha Accuracy and F1 score. 我不知道如何改善或解决此问题，以获得准确性和F1得分。 Is there any easier way to do this? 有没有更简单的方法可以做到这一点？

Thanks to all of you! 感谢大家！

Answer 1

There are few ways you can try to improve the execution performance: 有几种方法可以尝试改善执行性能：

Caching: If it fits to your setup, you can cache nodeDF , pairsDF , predictionsDF dataframes before calling getStats method. 缓存：如果它适合于您的设置，您可以缓存nodeDF ， pairsDF ， predictionsDF调用之前dataframes getStats方法。 In the second part of your code, same action has been made on same dataframe for multiple times graphDF.count() . 在代码的第二部分中，对同一个数据graphDF.count()了多次相同的动作graphDF.count() 。 As spark follows lazy evaluation method, Repeated execution will take place, so you can keep this value at a variable so that it can be used. 当spark遵循惰性评估方法时，将重复执行，因此您可以将此值保留为变量，以便可以使用它。
Find the culprit: Basically i follow the way to improve performance. 找到罪魁祸首：基本上，我遵循提高性能的方法。 When the spark job is submitted the sparkUI will show the whole execution plan made by spark and it'll show which task is taking more time and other resources. 提交spark作业后， sparkUI将显示spark制定的整个执行计划，并显示哪个任务花费更多时间和其他资源。 You may need more resources or do some tuning so that less shuffling occurs between executors. 您可能需要更多的资源或进行一些调整，以减少执行程序之间的改组。
Submit with optimal argument: Before submitting the spark job, make sure that optimal resource uses from the setup. 使用最佳参数提交：提交Spark作业之前，请确保从设置中使用最佳资源。 For more: optimal resource allocation 更多信息：最佳资源分配

如何使用Scala在Spark中评估minHashLSH？

问题描述

1 个解决方案

解决方案1
0 2019-01-27 20:30:09

如何使用Scala在Spark中评估minHashLSH？

问题描述

1 个解决方案

解决方案1 0 2019-01-27 20:30:09

解决方案1
0 2019-01-27 20:30:09