简体   繁体   中英

CoreNLP API for N-grams with position

Does CoreNLP have an API for getting ngrams with position etc.?

For example, I have a string "I have the best car ". if I am using mingrams=1 and maxgrams=2. I should get the following like below.I know stringutil with ngram function but how to get position.

(I,0)
(I have,0)
(have,1)
(have the,1)
(the,2)
(the best,2) etc etc

based on the string I am passing.

Any help is really appreciated.

Thanks

I don't see anything in the utils. Here is some sample code to help:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*; 
import edu.stanford.nlp.util.*;


public class NGramPositionExample {


    public static List<List<String>> getNGramsPositions(List<String> items, int minSize, int maxSize) {
        List<List<String>> ngrams = new ArrayList<List<String>>();
    int listSize = items.size();
    for (int i = 0; i < listSize; ++i) {
        for (int ngramSize = minSize; ngramSize <= maxSize; ++ngramSize) {
        if (i + ngramSize <= listSize) {
            List<String> ngram = new ArrayList<String>();
            for (int j = i; j < i + ngramSize; ++j) {
            ngram.add(items.get(j));
            }
                    ngram.add(Integer.toString(i));
            ngrams.add(ngram);
        }
        }
    }
    return ngrams;
    }


        public static void main (String[] args) throws IOException {
            String testString = "I have the best car";
            List<String> tokens = Arrays.asList(testString.split(" "));
            List<List<String>> ngramsAndPositions = getNGramsPositions(tokens,1,2);
            for (List<String> np : ngramsAndPositions) {
                System.out.println(Arrays.toString(np.toArray()));
            }
        }
}

You can just cut and paste that utility method.

This might be a useful functionality to add, so I will put this on our list of things to work on.

just spend some code to rewrite it in scala. its just the above code change it to scala. The out put will be like

NgramInfo(I,0)NgramInfo(I have,0)NgramInfo(have,1)NgramInfo(have the,1)NgramInfo(the,2)NgramInfo(the best,2)NgramInfo(best,3)NgramInfo(best car,3)NgramInfo(car,4) 

Below is the method with case class

   def getNgramPositions(items: List[String], minSize: Int, maxSize: Int): List[NgramInfo] = {
        var ngramList = new ListBuffer[NgramInfo]
        for (i <- 0 to items.size by 1) {
          for (ngramSize <- minSize until maxSize by 1) {
            if (i + ngramSize <= items.size) {
              var stringList = new ListBuffer[String]
              for (j <- i to i + ngramSize by 1) {
                if (j < items.size) {
                  stringList += items(j)
                  ngramList += new NgramInfo(stringList.mkString(" "), i)
                }
              }
            }
          }
        }
        ngramList.toList
      }

case class NgramInfo(term: String, termPosition: Int) extends Serializable

Thanks

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM