[英]calculate cosine similarity in scala
I have a file (tags.csv) that contains UserId, MovieId,tags.I want to use a domain-based method to calculate the cosine similarity between tags.我有一个包含 UserId、MovieId、tags 的文件(tags.csv)。我想使用基于域的方法来计算标签之间的余弦相似度。 I want to show the relevant tags for comedy only and measure similarity for each tag relevant to the comedy tag.
我只想显示喜剧的相关标签,并测量与喜剧标签相关的每个标签的相似性。
dataset数据集
My code is:我的代码是:
val rows = sc.textFile("/usr/local/comedy")
val vecData = rows.map(line => Vectors.dense(line.split(", ").map(_.toDouble)))
val mat = new RowMatrix(vecData)
val exact = mat.columnSimilarities()
val approx = mat.columnSimilarities(0.07)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
case (u, Some(v)) =>
math.abs(u - v)
case (u, None) =>
math.abs(u)
}.mean()
but this error appear:但出现此错误:
java.lang.NumberFormatException: For input string: "[1,898,"black comedy"]"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
What's wrong?怎么了?
The error message is full of pertinent info.错误消息充满了相关信息。
NumberFormatException: For input string: "[1,898,"black comedy"]"
It looks like the input String
isn't being split into separate column data.看起来输入
String
没有被拆分为单独的列数据。 So .split(", ")
isn't doing its job and it's easy to see why, there are no comma-space sequences to split on.所以
.split(", ")
没有做它的工作,很容易看出为什么,没有逗号空间序列可以拆分。
We could take out the space and split on just the comma but that would still leave a non-digit [
in the 1st column data and the 3rd column data has no digit characters at all.我们可以取出空格并仅在逗号上拆分,但这仍然会在第一列数据中留下非数字
[
并且第三列数据根本没有数字字符。
There are a few different ways to attack this.有几种不同的方法可以攻击这一点。 I'd be tempted to use a regex parser.
我很想使用正则表达式解析器。
val twoNums = "(\\d+),(\\d+),".r.unanchored
val vecData = rows.collect{ case twoNums(a, b) =>
Vectors.dense(Array(a.toDouble, b.toDouble))
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.