Remove special characters in R from .docx

Question

I've seen various posts about removing special characters in R (such as this one: Remove all special characters from a string in R? ), but none of the strategies have worked for my issue.

I have a transcript that I am reading in with qdap's read.transcript(). When I read in the document, it makes lines with special characters look like this:

If anyone knows how to simply change these special characters (i.e <e1><b8><9d> to e), again please feel free to update!

I have tried:

     ATL1$X2 <- gsub("[^0-9A-Za-z///,.?()' ]", "", ATL1$X2)
     If anyone knows how to simply change these special characters (i.e e1b89d to e), again please feel free to update

But that does not remove the special characters and also removes the !

I have also tried:

 str_replace_all(ATL1$X2, "[^[:alnum:]]", " ")
If anyone knows how to simply change these special characters  i e  e1  b8  9d  to e   again please feel free to update

But that is even worse and removes all punctuation and still doesn't fix my issue.

Last, I have also tried:

 iconv(ATL1$X2, from = 'UTF-8', to = 'ASCII//TRANSLIT')
 If anyone knows how to simply change these special characters (i.e <e1><b8><9d> to e), again please feel free to update!

But nothing was changed here either.

In an ideal world, the output would look like:

 If anyone knows how to simply change these special characters (i.e e e e to e), again please feel free to update!

Thus, the special characters are read in as what they "should" be. If this is not possible, I'd honestly be okay if it just removed the special characters (but not the other characters, like the exclamation points) and looked like this:

 If anyone knows how to simply change these special characters (i.e to e), again please feel free to update!

Thank you!

Answer 1

There are several things that make this hard:

You want to replace characters by something that's generally the same, not just converting encoding. In your example, "<e1><b8><9d>" does not stand for an "e", it stand for a complicated version of an "e", meaning R won't just change it. But there are functions to do that
It looks like qdap.transcript tries to be helpful. At least what you show here, and your results are consistent with, them not being special characters, but just literally being "<e1><b8><9d>". So if you try to remove special characters, gsub happily complies, and removes the "<" and ">", leaving "e1" and so forth alone.

To solve your problem, I think you want to convert back to special characters, and then use stri_trans_general from the stringi package. I'm sure there are other likewise functions out there, but this one works for me. It turns out converting back to the special characters is the hard part, but I've got some working code:

library(stringi)
mystring <- 'If anyone knows how to simply change these special characters (i.e <e1><b8><9d> to e), again please feel free to update!'
pos <- gregexpr('(<[A-Fa-f0-9]{2}>)+', mystring)[[1]]

replace <- substring(mystring, pos, pos+attr(pos, 'match.length')-1)
replace <- sapply(replace, function(r) {
  eval(parse(text=paste0('\'', gsub('>', '', gsub('<', '\\\\x', r)), '\'')))
})
for(i in seq_along(replace)) {
  mystring <- sub('(<[A-Fa-f0-9]{2}>)+', replace[i], mystring)
}
mystring <- stri_trans_general(mystring, 'latin-ascii')

We first extract everything that looks like hexadecimals between "<" and ">", then convert them to literal "\\xe1\\xb8\\x9d", and then ask R to process that, and replace the old values with those replacements.
Only at the last line we replace the special characters by (in this example) "e"

Remove special characters in R from .docx

Question

1 answers

solution1
4 ACCPTED 2018-12-11 19:25:20

Remove special characters in R from .docx

Question

1 answers

solution1 4 ACCPTED 2018-12-11 19:25:20

solution1
4 ACCPTED 2018-12-11 19:25:20