Why does Stanford CoreNLP server split named entities into single tokens?

Question

I'm using this command to post the data (a bit of copy pasta from the stanford site):

wget --post-data 'Barack Obama was President of the United States of America in 2016' 'localhost:9000/?properties={"annotators": "ner", "outputFormat": "json"}' -O out.json

The response looks like this:

{
    "sentences": [{
        "index": 0,
        "tokens": [{
            "index": 1,
            "word": "Barack",
            "originalText": "Barack",
            "lemma": "Barack",
            "characterOffsetBegin": 0,
            "characterOffsetEnd": 6,
            "pos": "NNP",
            "ner": "PERSON",
            "before": "",
            "after": " "
        }, {
            "index": 2,
            "word": "Obama",
            "originalText": "Obama",
            "lemma": "Obama",
            "characterOffsetBegin": 7,
            "characterOffsetEnd": 12,
            "pos": "NNP",
            "ner": "PERSON",
            "before": " ",
            "after": " "
        }, {
            "index": 3,
            "word": "was",
            "originalText": "was",
            "lemma": "be",
            "characterOffsetBegin": 13,
            "characterOffsetEnd": 16,
            "pos": "VBD",
            "ner": "O",
            "before": " ",
            "after": " "
        }, {
            "index": 4,
            "word": "President",
            "originalText": "President",
            "lemma": "President",
            "characterOffsetBegin": 17,
            "characterOffsetEnd": 26,
            "pos": "NNP",
            "ner": "O",
            "before": " ",
            "after": " "
        }, {
            "index": 5,
            "word": "of",
            "originalText": "of",
            "lemma": "of",
            "characterOffsetBegin": 27,
            "characterOffsetEnd": 29,
            "pos": "IN",
            "ner": "O",
            "before": " ",
            "after": " "
        }, {
            "index": 6,
            "word": "the",
            "originalText": "the",
            "lemma": "the",
            "characterOffsetBegin": 30,
            "characterOffsetEnd": 33,
            "pos": "DT",
            "ner": "O",
            "before": " ",
            "after": " "
        }, {
            "index": 7,
            "word": "United",
            "originalText": "United",
            "lemma": "United",
            "characterOffsetBegin": 34,
            "characterOffsetEnd": 40,
            "pos": "NNP",
            "ner": "LOCATION",
            "before": " ",
            "after": " "
        }, {
            "index": 8,
            "word": "States",
            "originalText": "States",
            "lemma": "States",
            "characterOffsetBegin": 41,
            "characterOffsetEnd": 47,
            "pos": "NNPS",
            "ner": "LOCATION",
            "before": " ",
            "after": " "
        }, {
            "index": 9,
            "word": "of",
            "originalText": "of",
            "lemma": "of",
            "characterOffsetBegin": 48,
            "characterOffsetEnd": 50,
            "pos": "IN",
            "ner": "LOCATION",
            "before": " ",
            "after": " "
        }, {
            "index": 10,
            "word": "America",
            "originalText": "America",
            "lemma": "America",
            "characterOffsetBegin": 51,
            "characterOffsetEnd": 58,
            "pos": "NNP",
            "ner": "LOCATION",
            "before": " ",
            "after": " "
        }, {
            "index": 11,
            "word": "in",
            "originalText": "in",
            "lemma": "in",
            "characterOffsetBegin": 59,
            "characterOffsetEnd": 61,
            "pos": "IN",
            "ner": "O",
            "before": " ",
            "after": " "
        }, {
            "index": 12,
            "word": "2016",
            "originalText": "2016",
            "lemma": "2016",
            "characterOffsetBegin": 62,
            "characterOffsetEnd": 66,
            "pos": "CD",
            "ner": "DATE",
            "normalizedNER": "2016",
            "before": " ",
            "after": "",
            "timex": {
                "tid": "t1",
                "type": "DATE",
                "value": "2016"
            }
        }]
    }]
}

Am I doing something wrong? I have Java client code that would at least recognize Barack Obama and United States of America as full NERs, but using the service it seems to treat each token separately. Any ideas why?

Answer 1

您应该将图entitymentions注释器添加到注释器列表中。

Why does Stanford CoreNLP server split named entities into single tokens?

Question

1 answers

solution1
2 2017-05-15 20:53:47

Why does Stanford CoreNLP server split named entities into single tokens?

Question

1 answers

solution1 2 2017-05-15 20:53:47

solution1
2 2017-05-15 20:53:47