简体   繁体   English

计算 scala 中的余弦相似度

[英]calculate cosine similarity in scala

I have a file (tags.csv) that contains UserId, MovieId,tags.I want to use a domain-based method to calculate the cosine similarity between tags.我有一个包含 UserId、MovieId、tags 的文件(tags.csv)。我想使用基于域的方法来计算标签之间的余弦相似度。 I want to show the relevant tags for comedy only and measure similarity for each tag relevant to the comedy tag.我只想显示喜剧的相关标签,并测量与喜剧标签相关的每个标签的相似性。

dataset数据集

文件中的数据示例

My code is:我的代码是:

val rows = sc.textFile("/usr/local/comedy")
val vecData = rows.map(line => Vectors.dense(line.split(", ").map(_.toDouble)))
val mat = new RowMatrix(vecData)
val exact = mat.columnSimilarities()
val approx = mat.columnSimilarities(0.07)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
  case (u, Some(v)) =>
    math.abs(u - v)
  case (u, None) =>
    math.abs(u)
}.mean()

but this error appear:但出现此错误:

java.lang.NumberFormatException: For input string: "[1,898,"black comedy"]"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)

What's wrong?怎么了?

The error message is full of pertinent info.错误消息充满了相关信息。

NumberFormatException: For input string: "[1,898,"black comedy"]"

It looks like the input String isn't being split into separate column data.看起来输入String没有被拆分为单独的列数据。 So .split(", ") isn't doing its job and it's easy to see why, there are no comma-space sequences to split on.所以.split(", ")没有做它的工作,很容易看出为什么,没有逗号空间序列可以拆分。

We could take out the space and split on just the comma but that would still leave a non-digit [ in the 1st column data and the 3rd column data has no digit characters at all.我们可以取出空格并仅在逗号上拆分,但这仍然会在第一列数据中留下非数字[并且第三列数据根本没有数字字符。

There are a few different ways to attack this.有几种不同的方法可以攻击这一点。 I'd be tempted to use a regex parser.我很想使用正则表达式解析器。

val twoNums = "(\\d+),(\\d+),".r.unanchored
val vecData = rows.collect{ case twoNums(a, b) =>
                Vectors.dense(Array(a.toDouble, b.toDouble))
              }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM