[英]How to do Nearest neighbor Search using Spark
我正在使用从链接中获取的数据集
想法是对列的字符串进行哈希处理和向量化,然后在数据集中搜索向量或最近的邻居向量。 我想出了下面的代码
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, HashingTF, IDF, MinHashLSH, Tokenizer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions.{col, udf}
object UberLsh extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("TFIDFExample")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.master("local[*]")
.getOrCreate()
val df = spark.read.csv("FL_insurance_sample.csv")
val dfUsed = df.select(col("_c0").as("title"), col("_c1").as("content"))
dfUsed.show()
// Tokenize the wiki content
val tokenizer = new Tokenizer().setInputCol("content").setOutputCol("words")
val wordsDf = tokenizer.transform(dfUsed)
println("Printing wordsDf")
wordsDf.show(false)
// Word count to vector for each wiki content
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(10)
val featurizedData = hashingTF.transform(wordsDf)
println("Hashing DF show")
featurizedData.show(false)
// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.show(false)
//Using lsh
println("Using lsh")
val mh = new MinHashLSH().setNumHashTables(3).setInputCol("features").setOutputCol("hashValues")
val model = mh.fit(rescaledData)
println("Model Hashvalues Show")
model.transform(rescaledData).show(false)
val vocabSize = 100
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(wordsDf)
val key = Vectors.sparse(vocabSize, Seq((cvModel.vocabulary.indexOf("Wood"), 1.0), (cvModel.vocabulary.indexOf("Wood"), 1.0)))
val k = 40
println("approxNearestNeighbors Show")
model.approxNearestNeighbors(rescaledData, key, k).show(false)
}
运行上面的程序正在矢量化数据集,但失败并显示以下错误。
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Found negative index: -1.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.SparseVector.<init>(Vectors.scala:576)
at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:226)
at com.sundogsoftware.spark.UberLsh$.delayedEndpoint$com$sundogsoftware$spark$UberLsh$1(UberLsh.scala:75)
at com.sundogsoftware.spark.UberLsh$delayedInit$body.apply(UberLsh.scala:18)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.sundogsoftware.spark.UberLsh$.main(UberLsh.scala:18)
at com.sundogsoftware.spark.UberLsh.main(UberLsh.scala)
Process finished with exit code 1
我试图找到解决此错误的方法,但没有太多运气。 样本数据如下所示,在我的CSV中只有2列
Residential,Wood
Residential,Wood
Residential,Wood
Residential,Wood
Residential,Wood
上面有什么我缺少的东西,感谢您的帮助,并在此先感谢您。
我发现有用的资源是: https : //databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.