简体   繁体   中英

Forcing the stanford parser to accept POS tags not licensed by the parser's lexicon

I have a file of pretokenized sentences, some of which are in the imperative (implicit subject, verb first, etc.). Without any partial tagging, the stanford parser wrongly tags the first word (a verb) as a noun in the subject of most (but not all) of these imperative sentences. With partial tagging (which I am fairly certain I am doing correctly - I've edited and recompiled LexicalizedParser to make sure the relevant command line options are recognized and end up in the right place within lexicalizedParser.java) on the first words of these sentences (using _VB), it behaves no differently than if the tagging were not there.

According to the lexparser package summary (look about 60% of the way down the page for "There are some restrictions on the interpretation...") this is because putting the POS tag VB on some of these words is just too weird for the parser to believe.

How do I get the parser to read and follow all the tags (preferably from the command-line)? Update the lexicon?

Using EnglishFactored.ser.gz rather than EnglishPCFG.ser.gz lessens this problem, but it does not go away.

Someone posted a similar question to the stanford [parser-user] mailing list a couple years ago, but I can't seem to find an answer to this post.

EDIT: Using another version of the parser (from August 20th, 2010), this problem does not seem to occur //at all//.

There is at present no way to make the parse tag things in a way that it regards as "too weird". If it regards a tag for a word as impossible, you can't make it possible, but you can specify what it should use within the range of what it regards as possible. Normally this is enough. It should be enough here. Here's an example. As you note, it often gets imperatives wrong unaided (partly because they're not well-evidenced in the training data. It doesn't always get them wrong, but it commonly does, and I chose 3 that it does get wrong:

$ cat imper.txt
Use care when opening.
Brush your hair!
Shut the door.
$ java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser englishPCFG.ser.gz imper.txt 2> /dev/null
(ROOT
  (S
    (NP (NNP Use))
    (VP (VBP care)
      (SBAR
        (WHADVP (WRB when))
        (S
          (VP (VBG opening)))))
    (. .)))

(ROOT
  (NP
    (NP (NNP Brush))
    (NP (PRP$ your) (NN hair))
    (. !)))

(ROOT
  (NP
    (NP (NNP Shut))
    (NP (DT the) (NN door))
    (. .)))

But with tokenized and partly tagged text like this:

$ cat imper.tok
Use_VB care when opening .
Brush_VB your hair !
Shut_VB the door .

all is fixed:

$ java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -tagSeparator _ englishPCFG.ser.gz imper.tok 2> /dev/null
(ROOT
  (S
    (VP (VB Use)
      (NP (NN care))
      (SBAR
        (WHADVP (WRB when))
        (S
          (VP (VBG opening)))))
    (. .)))

(ROOT
  (S
    (VP (VB Brush)
      (NP (PRP$ your) (NN hair)))
    (. !)))

(ROOT
  (S
    (VP (VB Shut)
      (NP (DT the) (NN door)))
    (. .)))

But you do have to use the right tags. It won't tag "Using" as a VB. That counts as too weird. "Using" as a verb should be a VBG. It's the present participle form, not the bare verb used in imperatives.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM