繁体   English   中英

如何使用Spark执行最近邻居搜索

[英]How to do Nearest neighbor Search using Spark

我正在使用从链接中获取的数据集

想法是对列的字符串进行哈希处理和向量化,然后在数据集中搜索向量或最近的邻居向量。 我想出了下面的代码


import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, HashingTF, IDF, MinHashLSH, Tokenizer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions.{col, udf}



object UberLsh extends App {

  Logger.getLogger("org").setLevel(Level.ERROR)

  val spark = SparkSession
    .builder()
    .appName("TFIDFExample")
    .config("spark.sql.warehouse.dir", "file:///C:/temp")
    .master("local[*]")
    .getOrCreate()


  val df = spark.read.csv("FL_insurance_sample.csv")
  val dfUsed = df.select(col("_c0").as("title"), col("_c1").as("content"))
  dfUsed.show()

  // Tokenize the wiki content
  val tokenizer = new Tokenizer().setInputCol("content").setOutputCol("words")
  val wordsDf = tokenizer.transform(dfUsed)
  println("Printing wordsDf")
  wordsDf.show(false)

  // Word count to vector for each wiki content
  val hashingTF = new HashingTF()
    .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(10)
  val featurizedData = hashingTF.transform(wordsDf)
  println("Hashing DF show")
  featurizedData.show(false)

  // alternatively, CountVectorizer can also be used to get term frequency vectors
  val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
  val idfModel = idf.fit(featurizedData)
  val rescaledData = idfModel.transform(featurizedData)

  rescaledData.show(false)

  //Using lsh
  println("Using lsh")
  val mh = new MinHashLSH().setNumHashTables(3).setInputCol("features").setOutputCol("hashValues")
  val model = mh.fit(rescaledData)
  println("Model Hashvalues Show")
  model.transform(rescaledData).show(false)


  val vocabSize = 100
  val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(wordsDf)

  val key = Vectors.sparse(vocabSize, Seq((cvModel.vocabulary.indexOf("Wood"), 1.0), (cvModel.vocabulary.indexOf("Wood"), 1.0)))
  val k = 40
  println("approxNearestNeighbors Show")
  model.approxNearestNeighbors(rescaledData, key, k).show(false)

}

运行上面的程序正在矢量化数据集,但失败并显示以下错误。


Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Found negative index: -1.
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.ml.linalg.SparseVector.<init>(Vectors.scala:576)
    at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:226)
    at com.sundogsoftware.spark.UberLsh$.delayedEndpoint$com$sundogsoftware$spark$UberLsh$1(UberLsh.scala:75)
    at com.sundogsoftware.spark.UberLsh$delayedInit$body.apply(UberLsh.scala:18)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.sundogsoftware.spark.UberLsh$.main(UberLsh.scala:18)
    at com.sundogsoftware.spark.UberLsh.main(UberLsh.scala)

Process finished with exit code 1

我试图找到解决此错误的方法,但没有太多运气。 样本数据如下所示,在我的CSV中只有2列

Residential,Wood
Residential,Wood
Residential,Wood
Residential,Wood
Residential,Wood

上面有什么我缺少的东西,感谢您的帮助,并在此先感谢您。

我发现有用的资源是: https : //databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM