简体   繁体   中英

How to tokenize words which are not in the dictionary in R?

I am working on a set of data which I need to tokenize for training it. Before doing tokenization, I have created a dictionary so that I need to retrieve those words present in the dictionary as such.

My text file is given below:

t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments.  These incisions may be placed in different parts of the abdominal wall.  Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length.  There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities.  Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."

My dictionary includes words:

dict <- c("hand-held surgical instruments", "intensive care unit", "traditional techniques")

Now I have applied the bigram tokenization for words in the document. For that I have used the following code:

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,PlainTextDocument)

#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- TermDocumentMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))

But I am getting the output as this:

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                            Docs
Terms                            character(0)
hand-held surgical instruments            0
intensive care unit                       0
traditional techniques                    1

But I need to tokenize the words those are not present in the dictionary using bigrams. Can anyone help me please?

You need to check what the dictionary does. It only returns the words in the dictionary.

dictionary: A character vector to be tabulated against. No other terms will be listed in the result. Defaults to NULL which means that all terms in doc are listed.

What you could use is the following code. Beware that removePunctuation also removes the hyphen between "hand-held". There is also no need for it. The tokenizer removes most of the punctiation anyway.

EDIT: based on comment

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)

#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            0
  intensive care unit                       0
  traditional techniques                    1

# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            1
  intensive care unit                       0
  traditional techniques                    0

# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)

Since where are using rowbind there will be double records in there based on the dictionary results. But working further with the data you can transform these into a dataframe like so and use dplyr to group them to a single line:

library(dplyr)    
df <- data.frame(terms = rownames(as.matrix(tdm_total)),   freq = rowSums(as.matrix(tdm_total)), row.names = NULL, stringsAsFactors = FALSE)
df <- df %>% group_by(terms) %>% summarise(sum(freq))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM