简体   繁体   English

如何标记R中不在词典中的单词?

[英]How to tokenize words which are not in the dictionary in R?

I am working on a set of data which I need to tokenize for training it. 我正在处理一组数据,我需要对这些数据进行标记化以进行培训。 Before doing tokenization, I have created a dictionary so that I need to retrieve those words present in the dictionary as such. 在进行标记化之前,我已经创建了一个词典,以便需要这样检索词典中存在的那些单词。

My text file is given below: 我的文本文件如下:

t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments.  These incisions may be placed in different parts of the abdominal wall.  Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length.  There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities.  Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."

My dictionary includes words: 我的词典中包含以下单词:

dict <- c("hand-held surgical instruments", "intensive care unit", "traditional techniques")

Now I have applied the bigram tokenization for words in the document. 现在,我对文档中的单词应用了双字标记。 For that I have used the following code: 为此,我使用了以下代码:

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,PlainTextDocument)

#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- TermDocumentMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))

But I am getting the output as this: 但是我得到的输出是这样的:

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                            Docs
Terms                            character(0)
hand-held surgical instruments            0
intensive care unit                       0
traditional techniques                    1

But I need to tokenize the words those are not present in the dictionary using bigrams. 但是我需要使用双字母组标记那些不在词典中的单词。 Can anyone help me please? 谁能帮我吗?

You need to check what the dictionary does. 您需要检查字典的作用。 It only returns the words in the dictionary. 它仅返回字典中的单词。

dictionary: A character vector to be tabulated against. 字典:要制表的字符向量。 No other terms will be listed in the result. 结果中不会列出其他术语 Defaults to NULL which means that all terms in doc are listed. 默认为NULL,这意味着doc中的所有术语都被列出。

What you could use is the following code. 您可以使用以下代码。 Beware that removePunctuation also removes the hyphen between "hand-held". 请注意,removePunctuation还会删除“手持式”之间的连字符。 There is also no need for it. 也没有必要。 The tokenizer removes most of the punctiation anyway. 令牌生成器还是会删除大部分标点。

EDIT: based on comment 编辑:基于评论

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)

#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            0
  intensive care unit                       0
  traditional techniques                    1

# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            1
  intensive care unit                       0
  traditional techniques                    0

# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)

Since where are using rowbind there will be double records in there based on the dictionary results. 由于在哪里使用rowbind,根据字典结果,那里会有双记录。 But working further with the data you can transform these into a dataframe like so and use dplyr to group them to a single line: 但是进一步处理数据,您可以像这样将它们转换为数据框,并使用dplyr将它们分组为一行:

library(dplyr)    
df <- data.frame(terms = rownames(as.matrix(tdm_total)),   freq = rowSums(as.matrix(tdm_total)), row.names = NULL, stringsAsFactors = FALSE)
df <- df %>% group_by(terms) %>% summarise(sum(freq))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM