繁体   English   中英

如何标记R中不在词典中的单词?

[英]How to tokenize words which are not in the dictionary in R?

我正在处理一组数据,我需要对这些数据进行标记化以进行培训。 在进行标记化之前,我已经创建了一个词典,以便需要这样检索词典中存在的那些单词。

我的文本文件如下:

t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments.  These incisions may be placed in different parts of the abdominal wall.  Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length.  There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities.  Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."

我的词典中包含以下单词:

dict <- c("hand-held surgical instruments", "intensive care unit", "traditional techniques")

现在,我对文档中的单词应用了双字标记。 为此,我使用了以下代码:

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,PlainTextDocument)

#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- TermDocumentMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))

但是我得到的输出是这样的:

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                            Docs
Terms                            character(0)
hand-held surgical instruments            0
intensive care unit                       0
traditional techniques                    1

但是我需要使用双字母组标记那些不在词典中的单词。 谁能帮我吗?

您需要检查字典的作用。 它仅返回字典中的单词。

字典:要制表的字符向量。 结果中不会列出其他术语 默认为NULL,这意味着doc中的所有术语都被列出。

您可以使用以下代码。 请注意,removePunctuation还会删除“手持式”之间的连字符。 也没有必要。 令牌生成器还是会删除大部分标点。

编辑:基于评论

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)

#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            0
  intensive care unit                       0
  traditional techniques                    1

# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            1
  intensive care unit                       0
  traditional techniques                    0

# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)

由于在哪里使用rowbind,根据字典结果,那里会有双记录。 但是进一步处理数据,您可以像这样将它们转换为数据框,并使用dplyr将它们分组为一行:

library(dplyr)    
df <- data.frame(terms = rownames(as.matrix(tdm_total)),   freq = rowSums(as.matrix(tdm_total)), row.names = NULL, stringsAsFactors = FALSE)
df <- df %>% group_by(terms) %>% summarise(sum(freq))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM