简体   繁体   English

为R中的单个单词/标记创建POS标签

[英]Creating POS tags for single words/tokens in R

I am looking for a way to create POS tags for single words/tokens from a list I have in R. I know that the accuracy will decrease if I do it for single tokens instead of sentences but the data I have are "delete edits" from Wikipedia and people mostly delete single, unconnected words instead of whole sentences. 我正在寻找一种从R中的列表中为单个单词/标记创建POS标签的方法。我知道,如果我对单个标记而不是句子执行此操作,准确性将会降低,但是我拥有的数据是“删除编辑”从Wikipedia上删除,人们通常会删除单个未连接的单词,而不是整个句子。 I have seen this question a few times for Python but I haven't found a solution for it in R yet. 对于Python我已经见过几次这个问题了,但是我还没有在R中找到解决方案。

My data will look somehwat like this 我的数据看起来像这样

Tokens <- list(c("1976","green","Normandy","coast","[", "[", "template" "]","]","Fish","visting","England","?"))

And ideally, I would like to have something like this returned: 理想情况下,我希望返回以下内容:

1976                   CD
green                  JJ
Normandy               NN
coast                  NN
[                      x
[                      x
template               NN
]                      x
]                      x
Fish                   NN
visiting               VBG
England                NN
?                      x

I found some websites doing that online but I doubt that they are running anything in R. They also specifically state NOT to use it on single words/Tokens. 我发现有些网站在网上这样做,但是我怀疑它们是否在R中运行任何东西。它们还明确声明不要在单个单词/令牌上使用它。

My Question thus: Is it possible to do this in R with reasonable accuracy? 因此,我的问题是:是否可以在R中以合理的精度执行此操作? How would the code look like to not incorporate sentence structure? 代码看起来如何不包含句子结构? Would it be easier to just compare the lists to a huge tagged diary? 将列表与带有标签的巨大日记进行比较会更容易吗?

In general, there is no decent post tagger in native R, and all possible solutions rely on outside libraries. 通常,本机R中没有合适的后标记器,并且所有可能的解决方案都依赖于外部库。 As one of such solutions, you can try our package spacyr using spaCy in the backend. 作为此类解决方案之一,您可以在后端使用spaCy尝试使用我们的软件包spacyr It's not on CRAN yet but soon to be. 它尚未在CRAN上,但很快就会出现。

https://github.com/kbenoit/spacyr https://github.com/kbenoit/spacyr

The sample code is like this: 示例代码如下:

library(spacyr)
spacy_initialize()

Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]",
            "Fish","visting","England","?")
spacy_parse(Tokens, tag = TRUE)

and the output is like this: 输出是这样的:

   doc_id sentence_id token_id    token    lemma   pos   tag entity
1   text1           1        1     1976     1976   NUM    CD DATE_B
2   text2           1        1    green    green   ADJ    JJ       
3   text3           1        1 Normandy normandy PROPN   NNP  ORG_B
4   text4           1        1    coast    coast  NOUN    NN       
5   text5           1        1        [        [ PUNCT -LRB-       
6   text6           1        1        [        [ PUNCT -LRB-       
7   text7           1        1 template template  NOUN    NN       
8   text8           1        1        ]        ] PUNCT -RRB-       
9   text9           1        1        ]        ] PUNCT -RRB-       
10 text10           1        1     Fish     fish  NOUN    NN       
11 text11           1        1  visting     vist  VERB   VBG       
12 text12           1        1  England  england PROPN   NNP  GPE_B
13 text13           1        1        ?        ? PUNCT     .   

Although the package can do more, you can find what you need in tag field. 尽管该软件包可以做更多的事情,但是您可以在tag字段中找到所需的内容。

NOTE: (2017-05-20) 注意:(2017-05-20)

Now spacyr package is on CRAN, but the version has some issues with non-ascii characters. 现在spacyr软件包位于CRAN上,但是该版本存在一些非ASCII字符的问题。 We recognized the issue after CRAN submission and resolved in the version on github. 在提交CRAN之后,我们意识到了这个问题,并在github中的版本中得以解决。 If you are planning to use it for German texts, please install the latest master on github. 如果您打算将其用于德语文本,请在github上安装最新的master。 devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE) This revision will be incorporated to CRAN package in next update. devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)此修订版将在下一个更新中合并到CRAN软件包中。

NOTE2: 笔记2:

There are detailed instructions for installing spaCy and spacyr on Windows and Mac. 有在Windows和Mac上安装spaCy和spacyr的详细说明。

Windows: https://github.com/kbenoit/spacyr/blob/master/inst/doc/WINDOWS.md Windows: https//github.com/kbenoit/spacyr/blob/master/inst/doc/WINDOWS.md

Mac: https://github.com/kbenoit/spacyr/blob/master/inst/doc/MAC.md Mac: https//github.com/kbenoit/spacyr/blob/master/inst/doc/MAC.md

Heres the steps I took to make amatsuo_net's suggestion work for me: 这是为使amatsuo_net的建议对我有用而采取的步骤:

  1. Installing spaCy and english language library for anaconda: 为anaconda安装spaCy和英语语言库:

    Open Anaconda prompt as Admin 以管理员身份打开Anaconda提示

    execute: 执行:

    activate py36

    conda config --add channels conda-forge

    conda install spacy

    python -m spacy link en_core_web_sm en

  2. Using the Wrapper for R studio: 使用Wrapper for R studio:

    install.packages("fastmatch") install.packages("RcppParallel")

    library(fastmatch) library(RcppParallel)

    devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)

    library(spacyr)

    spacy_initialize(condaenv = "py36")

    Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]","Fish","visting","England","?");Tokens

    spacy_parse(Tokens, tag = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM