简体   繁体   中英

Creating POS tags for single words/tokens in R

I am looking for a way to create POS tags for single words/tokens from a list I have in R. I know that the accuracy will decrease if I do it for single tokens instead of sentences but the data I have are "delete edits" from Wikipedia and people mostly delete single, unconnected words instead of whole sentences. I have seen this question a few times for Python but I haven't found a solution for it in R yet.

My data will look somehwat like this

Tokens <- list(c("1976","green","Normandy","coast","[", "[", "template" "]","]","Fish","visting","England","?"))

And ideally, I would like to have something like this returned:

1976                   CD
green                  JJ
Normandy               NN
coast                  NN
[                      x
[                      x
template               NN
]                      x
]                      x
Fish                   NN
visiting               VBG
England                NN
?                      x

I found some websites doing that online but I doubt that they are running anything in R. They also specifically state NOT to use it on single words/Tokens.

My Question thus: Is it possible to do this in R with reasonable accuracy? How would the code look like to not incorporate sentence structure? Would it be easier to just compare the lists to a huge tagged diary?

In general, there is no decent post tagger in native R, and all possible solutions rely on outside libraries. As one of such solutions, you can try our package spacyr using spaCy in the backend. It's not on CRAN yet but soon to be.

https://github.com/kbenoit/spacyr

The sample code is like this:

library(spacyr)
spacy_initialize()

Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]",
            "Fish","visting","England","?")
spacy_parse(Tokens, tag = TRUE)

and the output is like this:

   doc_id sentence_id token_id    token    lemma   pos   tag entity
1   text1           1        1     1976     1976   NUM    CD DATE_B
2   text2           1        1    green    green   ADJ    JJ       
3   text3           1        1 Normandy normandy PROPN   NNP  ORG_B
4   text4           1        1    coast    coast  NOUN    NN       
5   text5           1        1        [        [ PUNCT -LRB-       
6   text6           1        1        [        [ PUNCT -LRB-       
7   text7           1        1 template template  NOUN    NN       
8   text8           1        1        ]        ] PUNCT -RRB-       
9   text9           1        1        ]        ] PUNCT -RRB-       
10 text10           1        1     Fish     fish  NOUN    NN       
11 text11           1        1  visting     vist  VERB   VBG       
12 text12           1        1  England  england PROPN   NNP  GPE_B
13 text13           1        1        ?        ? PUNCT     .   

Although the package can do more, you can find what you need in tag field.

NOTE: (2017-05-20)

Now spacyr package is on CRAN, but the version has some issues with non-ascii characters. We recognized the issue after CRAN submission and resolved in the version on github. If you are planning to use it for German texts, please install the latest master on github. devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE) This revision will be incorporated to CRAN package in next update.

NOTE2:

There are detailed instructions for installing spaCy and spacyr on Windows and Mac.

Windows: https://github.com/kbenoit/spacyr/blob/master/inst/doc/WINDOWS.md

Mac: https://github.com/kbenoit/spacyr/blob/master/inst/doc/MAC.md

Heres the steps I took to make amatsuo_net's suggestion work for me:

  1. Installing spaCy and english language library for anaconda:

    Open Anaconda prompt as Admin

    execute:

    activate py36

    conda config --add channels conda-forge

    conda install spacy

    python -m spacy link en_core_web_sm en

  2. Using the Wrapper for R studio:

    install.packages("fastmatch") install.packages("RcppParallel")

    library(fastmatch) library(RcppParallel)

    devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)

    library(spacyr)

    spacy_initialize(condaenv = "py36")

    Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]","Fish","visting","England","?");Tokens

    spacy_parse(Tokens, tag = TRUE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM