简体   繁体   English

如何在scala中生成n元语法?

[英]How to generate n-grams in scala?

I am trying to code dissociated press algorithm based on n-gram in scala. 我正在尝试在Scala中编写基于n-gram的解压新闻算法。 How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". 如何为大型文件生成n元语法:例如,对于包含“蜜蜂是蜜蜂的蜜蜂”的文件。

  1. First it has to pick a random n-gram. 首先,它必须选择一个随机的n-gram。 For example, the bee. 例如,蜜蜂。
  2. Then it has to look for n-grams starting with (n-1) words. 然后,它必须寻找以(n-1)个单词开头的n-gram。 For example, bee of. 例如,蜜蜂。
  3. it prints the last word of this n-gram. 它打印出该n-gram的最后一个单词。 Then repeats. 然后重复。

Can you please give me some hints how to do it? 你能给我一些提示怎么做吗? Sorry for the inconvenience. 抱歉给你带来不便。

Your questions could be a little more specific but here is my try. 您的问题可能会更具体一些,但这是我的尝试。

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))

You may try this with a parameter of n 您可以使用参数n尝试

val words = "the bee is the bee of the bees"
val w = words.split(" ")

val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println

List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

Here is a stream based approach. 这是一种基于流的方法。 This will not required too much memory while computing n-grams. 计算n-gram时不需要太多内存。

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT: 输出:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM