简体   繁体   中英

how to keep hashtags and their words as a single token

How to change the default setting in case I would like to keep the hashtag symbol and its word intact ( ie #company and not # and company)

x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")

ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model, 
                            "This is a better #company than i thought @mr_jones!", 
                            tokenizer = "tokenizer", 
                            tagger = "default", 
                            trace = TRUE)

anno_op3 <- as.data.table(as.data.frame(anno_op3))

View(anno_op3)

What i am getting is # and company as two different tokens. I want #company as a single token. Although i am getting @mr_jones as a single token.

You can combine other tokenisation tools with the udpipe R package. This is shown at https://bnosac.github.io/udpipe/docs/doc2.html . Eg below a tokeniser specific to twitter messages is used and after that parts of speech tagging, morphological feature annotation and dependency parsing is done with udpipe

library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"), 
                     lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
 doc_id paragraph_id sentence_id sentence start end term_id token_id                token                lemma upos xpos                                                  feats head_token_id  dep_rel deps misc
   doc1            1           1     <NA>     1   7       1        1              #rstats               #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs             4    nsubj <NA> <NA>
   doc1            1           1     <NA>     9  10       2        2                   is                   be  AUX  VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4      cop <NA> <NA>
   doc1            1           1     <NA>    12  12       3        3                    a                    a  DET   DT                              Definite=Ind|PronType=Art             4      det <NA> <NA>
   doc1            1           1     <NA>    14  33       4        4 programming_language programming_language NOUN   NN                                            Number=Sing             0     root <NA> <NA>
   doc2            1           1     <NA>     1   3       1        1                  you                  you PRON  PRP                         Case=Nom|Person=2|PronType=Prs             3    nsubj <NA> <NA>
   doc2            1           1     <NA>     5   7       2        2                  can                  can  AUX   MD                                           VerbForm=Fin             3      aux <NA> <NA>
   doc2            1           1     <NA>     9  15       3        3              combine              combine VERB   VB                                           VerbForm=Inf             0     root <NA> <NA>
   doc2            1           1     <NA>    17  19       4        4                  the                  the  DET   DT                              Definite=Def|PronType=Art             6      det <NA> <NA>
   doc2            1           1     <NA>    21  31       5        5          #tokenizers           #tokenizer NOUN  NNS                                            Number=Plur             6 compound <NA> <NA>
   doc2            1           1     <NA>    33  39       6        6              package              package NOUN   NN                                            Number=Sing             3      obj <NA> <NA>
   doc2            1           1     <NA>    41  44       7        7                 with                 with  ADP   IN                                                   <NA>             9     case <NA> <NA>
   doc2            1           1     <NA>    46  52       8        8              @udpipe              @udpipe NOUN   NN                                            Number=Sing             9 compound <NA> <NA>
   doc2            1           1     <NA>    54  60       9        9              parsing              parsing NOUN   NN                                            Number=Sing             6     nmod <NA> <NA>
> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM