简体   繁体   中英

How detect foreign words in Corpus?

Suppose I am parsing an english corpus with the tm package, and I do the usual cleaning steps.

library(tm)
data("crude")
corpus <- Corpus(crude)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords)) stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)

# text matrices
tdm <- TermDocumentMatrix(corpus)
dtm<- DocumentTermMatrix(corpus)

How do I identify the words written in a different language than the one of the corpus? A similar problem is faced with Python here , but my research did not produces interesting results.

This is not a complete solution, but I feel like it might help. I recently had to do something similar where I had to remove words from a corpus with Chinese characters. I ended up using a custom transformation with a regex to remove anything with a non az 0-9 character in it.

corpus <- tm_map(corpus, content_transformer(function(s){
  gsub(pattern = '[^a-zA-Z0-9\\s]+',
       x = s,
       replacement = " ",
       ignore.case = TRUE,
       perl = TRUE)
}))

For example, if there is a Chinese word in there, it gets removed.

gsub(pattern = '[^a-zA-Z0-9\\s]+',
     x = 'English 象形字 Chinese',
     replacement = "",
     ignore.case = TRUE,
     perl = TRUE)

Output: "English Chinese"

It's trickier if you are trying to remove words from a language like Spanish because some letters have an accent while others don't. For example, this doesn't work completely, but maybe it's a start.

gsub(pattern = '[a-zA-Z0-9]+[^a-zA-Z0-9\\s]+[a-zA-Z0-9]+',
     x = 'El jalapeño es caliente',
     replacement = "",
     ignore.case = TRUE,
     perl = TRUE)

Output: "El es caliente"

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM