简体   繁体   English

Stanford CoreNLP:-nthreads标志使所有ner值均为O

[英]Stanford CoreNLP: -nthreads flag causes all ner values to be O

I'm trying to figure out why adding the -nthreads {int} argument to the Stanford CoreNLP (version stanford-corenlp-full-2015-12-09 ) causes all tokens to have an output NER value of O (= not a named entity). 我试图弄清楚为什么将-nthreads {int}参数添加到Stanford CoreNLP(版本stanford-corenlp-full-2015-12-09 )中会导致所有令牌的输出NER值均为O(=未命名)实体)。

As a simple example, create a file called sample-file.txt and make its contents say Samuel Huntington . 作为一个简单的示例,创建一个名为sample-file.txt的文件,并使其内容为Samuel Huntington Then run: 然后运行:

java -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file sample-file.txt -outputFormat json

This will generate the expected output, with "Samuel" recognized as a person: 这将产生预期的输出,并且将“ Samuel”识别为个人:

{
  "sentences": [
    {
      "index": 0,
      "parse": "SENTENCE_SKIPPED_OR_UNPARSABLE",
      "tokens": [
        {
          "index": 1,
          "word": "Samuel",
          "originalText": "Samuel",
          "lemma": "Samuel",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 6,
          "pos": "NNP",
          "ner": "PERSON",
          "before": "",
          "after": " "
        }, ...

If you add -nthreads 8 to the command above, however, the output does not indicate Samuel is a person. 但是,如果将-nthreads 8添加到上面的命令,则输出不会指示Samuel是一个人。 Full command: 完整命令:

java -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file sample-file.txt -outputFormat json -nthreads 8 

Which generates: 会产生:

{
  "sentences": [
    {
      "index": 0,
      "parse": "SENTENCE_SKIPPED_OR_UNPARSABLE",
      "tokens": [
        {
          "index": 1,
          "word": "Samuel",
          "originalText": "Samuel",
          "lemma": "Samuel",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 6,
          "pos": "NNP",
          "ner": "O",
          "before": "",
          "after": " "
        },

For what it's worth, -nthread {int} (that is, thread with no s) resolves the problem, so I can just use this command. 对于-nthread {int} (即,不带s的线程)而言,它可以解决问题,所以我可以使用此命令。 I'll leave this question here in case others try using the -nthreads flag however. 如果其他人尝试使用-nthreads标志,我将在这里留下这个问题。

PS Here's my CLASSPATH (obtained from echo $CLASSPATH ), which contains only the Stanford CoreNLP distribution I downloaded last week: /Users/dduhaime/Desktop/everett/wiki_facts/stanford-corenlp-full-2015-12-09/*: PS这是我的CLASSPATH(从echo $CLASSPATH ),它仅包含我上周下载的Stanford CoreNLP发行版: /Users/dduhaime/Desktop/everett/wiki_facts/stanford-corenlp-full-2015-12-09/*:

I ran this command: 我运行了以下命令:

java -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file sample-file.txt -outputFormat json

with sample-file.txt just containing the text "Samuel Huntington" and it found both tokens to be PERSON. 带有只包含文本“ Samuel Huntington”的sample-file.txt,它发现两个标记均为PERSON。

My CLASSPATH contains all the jars in the distribution. 我的CLASSPATH包含分发中的所有jar。

Two things you should start with are re-download the 2015-12-09 distribution (we sometimes make small changes to this before it its finalized) just to make sure you have the latest version and investigate in your Python code what the CLASSPATH is. 首先应该做的两件事是重新下载2015-12-09发行版(有时我们会在最终确定发行版之前对其进行一些小的更改),以确保您具有最新版本并在Python代码中调查CLASSPATH是什么。 Perhaps before running the java command add an "echo $CLASSPATH ; " Because looking at your java command you aren't setting the CLASSPATH, so it is not totally clear what the CLASSPATH that is being used is. 也许在运行Java命令之前添加一个“ echo $ CLASSPATH;”,因为您没有在设置Java CLASSPATH,所以还不清楚它所使用的CLASSPATH是什么。

使用-nthread {int}代替-nthreads {int}生成预期的NER输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM