Stanford Dependency Parser - how to get spans?

Question

I'm doing dependency parsing with the Stanford library in Java. Is there any way to get back the indices within my original string of a dependency? I have tried to call the getSpans() method, but it returns null for every token:

LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80", "-retainTmpSubcategories");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
Tree parse = lp.apply(text);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsedTree();
for(TypedDependency td:tdl)
{
      td.gov().getSpan()  // it's null!
      td.dep().getSpan()  // it's null!
}

Any idea?

Answer 1

I've finally ended up writing my own helper function to get the spans out my original string:

public HashMap<Integer, TokenSpan> getTokenSpans(String text, Tree parse)
{
    List<String> tokens = new ArrayList<String>();
    traverse(tokens, parse, parse.getChildrenAsList());
    return extractTokenSpans(text, tokens);
}

private void traverse(List<String> tokens, Tree parse, List<Tree> children)
{
    if(children == null)
        return;
    for(Tree child:children)
    {
        if(child.isLeaf())
        {
            tokens.add(child.value());
        }
        traverse(tokens, parse, child.getChildrenAsList());         
    }
}

private HashMap<Integer, TokenSpan> extractTokenSpans(String text, List<String> tokens)
{
    HashMap<Integer, TokenSpan> result = new HashMap<Integer, TokenSpan>();
    int spanStart, spanEnd;

    int actCharIndex = 0;
    int actTokenIndex = 0;
    char actChar;
    while(actCharIndex < text.length())
    {
        actChar = text.charAt(actCharIndex);
        if(actChar == ' ')
        {
            actCharIndex++;
        }
        else
        {
            spanStart = actCharIndex;
            String actToken = tokens.get(actTokenIndex);
            int tokenCharIndex = 0;
            while(tokenCharIndex < actToken.length() && text.charAt(actCharIndex) == actToken.charAt(tokenCharIndex))
            {
                tokenCharIndex++;
                actCharIndex++;
            }

            if(tokenCharIndex != actToken.length())
            {
                //TODO: throw exception
            }
            actTokenIndex++;
            spanEnd = actCharIndex;
            result.put(actTokenIndex, new TokenSpan(spanStart, spanEnd));
        }
    }
    return result;
}

Then I will call

 getTokenSpans(originalString, parse)

So I get a map, which can map every token to its corresponding token span. It's not an elegant solution, but at least it works.

Answer 2

Even though you've answered your own questions already and this is an old thread: I just stumbled upon the same problem today, but with the (Stanford) LexicalizedParser, instead of the Dependency Parser. Haven't tested it for the dependency one, but the following solved my problem in the lexParser scenario:

List<Word> wl = tree.yieldWords();
int begin = wl.get(0).beginPosition();
int end = wl.get(wl.size()-1).endPosition();
Span sp = new Span(begin, end);

Where the Span then holds the indices of the (sub)tree. (And if you go all the way down to the terminals, I guess the same should work on token-level).

Hope this helps someone else running into the same problem!

Stanford Dependency Parser - how to get spans?

Question

2 answers

solution1
2 ACCPTED 2013-04-25 07:38:02

solution2
0 2016-11-23 13:34:19

Stanford Dependency Parser - how to get spans?

Question

2 answers

solution1 2 ACCPTED 2013-04-25 07:38:02

solution2 0 2016-11-23 13:34:19

solution1
2 ACCPTED 2013-04-25 07:38:02

solution2
0 2016-11-23 13:34:19