[英]How to generate n-grams in scala?
I am trying to code dissociated press algorithm based on n-gram in scala. 我正在尝试在Scala中编写基于n-gram的解压新闻算法。 How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". 如何为大型文件生成n元语法:例如,对于包含“蜜蜂是蜜蜂的蜜蜂”的文件。
Can you please give me some hints how to do it? 你能给我一些提示怎么做吗? Sorry for the inconvenience. 抱歉给你带来不便。
Your questions could be a little more specific but here is my try. 您的问题可能会更具体一些,但这是我的尝试。
val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
You may try this with a parameter of n 您可以使用参数n尝试
val words = "the bee is the bee of the bees"
val w = words.split(" ")
val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println
List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
Here is a stream based approach. 这是一种基于流的方法。 This will not required too much memory while computing n-grams. 计算n-gram时不需要太多内存。
object ngramstream extends App {
def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
case x #:: xs => {
f(x)
process(xs)(f)
}
case _ => Stream[Array[String]]()
}
def ngrams(n: Int, words: Array[String]) = {
// exclude 1-grams
(2 to n).map { i => words.sliding(i).toStream }
.foldLeft(Stream[Array[String]]()) {
(a, b) => a #::: b
}
}
val words = "the bee is the bee of the bees"
val n = 4
val ngrams2 = ngrams(n, words.split(" "))
process(ngrams2) { x =>
println(x.toList)
}
}
OUTPUT: 输出:
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.