Suppose I am parsing an english corpus with the tm
package, and I do the usual cleaning steps.
library(tm)
data("crude")
corpus <- Corpus(crude)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords)) stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
# text matrices
tdm <- TermDocumentMatrix(corpus)
dtm<- DocumentTermMatrix(corpus)
How do I identify the words written in a different language than the one of the corpus? A similar problem is faced with Python here , but my research did not produces interesting results.
This is not a complete solution, but I feel like it might help. I recently had to do something similar where I had to remove words from a corpus with Chinese characters. I ended up using a custom transformation with a regex to remove anything with a non az 0-9 character in it.
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
For example, if there is a Chinese word in there, it gets removed.
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = 'English 象形字 Chinese',
replacement = "",
ignore.case = TRUE,
perl = TRUE)
Output: "English Chinese"
It's trickier if you are trying to remove words from a language like Spanish because some letters have an accent while others don't. For example, this doesn't work completely, but maybe it's a start.
gsub(pattern = '[a-zA-Z0-9]+[^a-zA-Z0-9\\s]+[a-zA-Z0-9]+',
x = 'El jalapeño es caliente',
replacement = "",
ignore.case = TRUE,
perl = TRUE)
Output: "El es caliente"
Hope this helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.