简体   繁体   English

相当于命令行的CoreNLP API?

[英]CoreNLP API equivalent to command line?

For one of our project, we are currently using the syntax analysis component with the command line. 对于我们的项目之一,我们当前正在命令行中使用语法分析组件。 We want to move from this approach to now use the corenlp server (for better performances). 我们希望从这种方法过渡到现在使用corenlp服务器(以获得更好的性能)。

Our command line options are as follow: 我们的命令行选项如下:

java -mx4g -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -escaper edu.stanford.nlp.process.PTBEscapingProcessor  -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory  -outputFormat "wordsAndTags,typedDependenciesCollapsed"

I've tried a few things but I didn't manage to find the proper options when using the corenlp API (with Python). 我做了一些尝试,但是在使用corenlp API(与Python结合使用)时,我没有设法找到合适的选项。

For instance, how to specify that the text is already tokenised? 例如,如何指定文本已被标记化?

I would really appreciate any help. 我真的很感谢您的帮助。

In general, the server calls into CoreNLP rather than the individual NLP components, so the documentation on CoreNLP may be useful. 通常,服务器将调用CoreNLP而不是单个NLP组件,因此有关CoreNLP的文档可能会有用。 The body of the text being annotated is sent to the server as the POST body; 被注释文本的主体作为POST主体发送到服务器; the properties are passed in as URL params. 属性作为URL参数传递。 For example, for your case, I believe the following curl command should do the trick (and should be easy to adapt to the language of your choice): 例如,对于您的情况,我相信以下curl命令应该可以解决问题(并且应该易于适应您选择的语言):

curl -X POST -d "it's split on whitespace" \
  'http://localhost:9000/?annotators=tokenize,ssplit,pos,parse&tokenize.whitespace=true&ssplit.eolonly=true'

Note that we're just passing the following properties into the server: 请注意,我们只是将以下属性传递到服务器中:

  • annotators = tokenize,ssplit,pos,parse (specifies that we want the parser, and all its prerequisites). annotators = tokenize,ssplit,pos,parse (指定我们需要解析器及其所有先决条件)。
  • tokenize.whitespace = true will call the withespace tokenizer. tokenize.whitespace = true将调用withespace标记生成器。
  • ssplit.eolonly = true will split sentences on and only on newlines. ssplit.eolonly = true将只在换行符上拆分句子。

Other potentially useful options are documented on the parser annotator page . 其他可能有用的选项记录在解析器注释器页面上

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM