简体   繁体   English

如何使用OpenNLP在R中获取POS标签?

[英]How to use OpenNLP to get POS tags in R?

Here is the R Code: 这是R代码:

library(NLP) 
library(openNLP)
tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <-  tagPOS(str)

Output is : 输出是:

tagged_str $POStagged [1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./." tagged_str $ POStagged [1]“this / DT is / VBZ a / DT the / DT first / JJ sentence / NN ./。”

Now I want to extract only NN word ie sentence from the above sentence and want to store it into a variable .Can anyone help me out with this . 现在我想从上面的句子中只提取NN单词即句子,并希望将其存储到变量中。任何人都可以帮我解决这个问题。

Here is a more general solution, where you can describe the Treebank tag you desire to extract using a regular expression. 这是一个更通用的解决方案,您可以使用正则表达式描述要提取的Treebank标记。 In your case for instance, "NN" returns all noun types (eg NN, NNS, NNP, NNPS) while "NN$" returns just NN. 例如,在你的情况下,“NN”返回所有名词类型(例如NN,NNS,NNP,NNPS),而“NN $”仅返回NN。

It operates on a character type, so if you have your texts as a list, you will need to lapply() it as in the examples below. 它以字符类型运行,因此如果您将文本作为列表,则需要像下面的示例中那样使用lapply()

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

extractPOS <- function(x, thisPOSregex) {
    x <- as.String(x)
    wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
    POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
    POSwords <- subset(POSAnnotation, type == "word")
    tags <- sapply(POSwords$features, '[[', "POS")
    thisPOSindex <- grep(thisPOSregex, tags)
    tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
    untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
    untokenizedAndTagged
}

lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
## 
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
## 
## [[2]]
## [1] ""

Here is another answer that uses the spaCy parser and tagger, from Python, and the spacyr package to call it. 这是另一个使用Python的spaCy解析器和标记器以及pacyr包来调用它的答案

This library is orders of magnitude faster and almost as good as the stanford NLP models. 这个库的速度快几个数量级,几乎和斯坦福NLP模型一样好。 It is still incomplete in some languages, but for english is a pretty good and promising option. 在某些语言中它仍然不完整,但对于英语来说,这是一个非常好的和有前景的选择。

You first need to have Python installed and to have installed spaCy and a language module. 您首先需要安装Python并安装spaCy和语言模块。 Instructions are available from the spaCy page and here . 可在spaCy页面此处获取说明

Then: 然后:

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)

spacy_parse(txt, pos = TRUE, tag = TRUE)
##    doc_id sentence_id token_id   token   lemma   pos tag   entity
## 1   text1           1        1    This    this   DET  DT         
## 2   text1           1        2      is      be  VERB VBZ         
## 3   text1           1        3       a       a   DET  DT         
## 4   text1           1        4   short   short   ADJ  JJ         
## 5   text1           1        5 tagging tagging  NOUN  NN         
## 6   text1           1        6 example example  NOUN  NN         
## 7   text1           1        7       ,       , PUNCT   ,         
## 8   text1           1        8      by      by   ADP  IN         
## 9   text1           1        9    John    john PROPN NNP PERSON_B
## 10  text1           1       10     Doe     doe PROPN NNP PERSON_I
## 11  text1           1       11       .       . PUNCT   .         
## 12  text2           1        1     Too     too   ADV  RB         
## 13  text2           1        2     bad     bad   ADJ  JJ         
## 14  text2           1        3 OpenNLP opennlp PROPN NNP         
## 15  text2           1        4      is      be  VERB VBZ         
## 16  text2           1        5      so      so   ADV  RB         
## 17  text2           1        6    slow    slow   ADJ  JJ         
## 18  text2           1        7      on      on   ADP  IN         
## 19  text2           1        8   large   large   ADJ  JJ         
## 20  text2           1        9   texts    text  NOUN NNS         
## 21  text2           1       10       .       . PUNCT   . 

There might be more elegant ways to obtain the result, but this one should work: 可能有更优雅的方法来获得结果,但这个方法应该有效:

q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM