简体   繁体   中英

Stanford Dependency Parser - how to get spans?

I'm doing dependency parsing with the Stanford library in Java. Is there any way to get back the indices within my original string of a dependency? I have tried to call the getSpans() method, but it returns null for every token:

LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80", "-retainTmpSubcategories");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
Tree parse = lp.apply(text);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsedTree();
for(TypedDependency td:tdl)
{
      td.gov().getSpan()  // it's null!
      td.dep().getSpan()  // it's null!
}

Any idea?

I've finally ended up writing my own helper function to get the spans out my original string:

public HashMap<Integer, TokenSpan> getTokenSpans(String text, Tree parse)
{
    List<String> tokens = new ArrayList<String>();
    traverse(tokens, parse, parse.getChildrenAsList());
    return extractTokenSpans(text, tokens);
}

private void traverse(List<String> tokens, Tree parse, List<Tree> children)
{
    if(children == null)
        return;
    for(Tree child:children)
    {
        if(child.isLeaf())
        {
            tokens.add(child.value());
        }
        traverse(tokens, parse, child.getChildrenAsList());         
    }
}

private HashMap<Integer, TokenSpan> extractTokenSpans(String text, List<String> tokens)
{
    HashMap<Integer, TokenSpan> result = new HashMap<Integer, TokenSpan>();
    int spanStart, spanEnd;

    int actCharIndex = 0;
    int actTokenIndex = 0;
    char actChar;
    while(actCharIndex < text.length())
    {
        actChar = text.charAt(actCharIndex);
        if(actChar == ' ')
        {
            actCharIndex++;
        }
        else
        {
            spanStart = actCharIndex;
            String actToken = tokens.get(actTokenIndex);
            int tokenCharIndex = 0;
            while(tokenCharIndex < actToken.length() && text.charAt(actCharIndex) == actToken.charAt(tokenCharIndex))
            {
                tokenCharIndex++;
                actCharIndex++;
            }

            if(tokenCharIndex != actToken.length())
            {
                //TODO: throw exception
            }
            actTokenIndex++;
            spanEnd = actCharIndex;
            result.put(actTokenIndex, new TokenSpan(spanStart, spanEnd));
        }
    }
    return result;
}

Then I will call

 getTokenSpans(originalString, parse)

So I get a map, which can map every token to its corresponding token span. It's not an elegant solution, but at least it works.

Even though you've answered your own questions already and this is an old thread: I just stumbled upon the same problem today, but with the (Stanford) LexicalizedParser, instead of the Dependency Parser. Haven't tested it for the dependency one, but the following solved my problem in the lexParser scenario:

List<Word> wl = tree.yieldWords();
int begin = wl.get(0).beginPosition();
int end = wl.get(wl.size()-1).endPosition();
Span sp = new Span(begin, end);

Where the Span then holds the indices of the (sub)tree. (And if you go all the way down to the terminals, I guess the same should work on token-level).

Hope this helps someone else running into the same problem!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM