简体   繁体   中英

Keeping Turkish characters with the text mining package for R

let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.

Here's what I did:

docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")

My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))

also saves the file without the characters in ANSI encoding.

This seems to not only be an issue with the output file.

writeLines(as.character(docs[[1]])

for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"

After reading this: UTF-8 file output in R I also tried the following code:

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)

which didn't change the results.

All of this is on Windows 7 with both the most recent version of R and RStudio.

Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.

Here is how I keep the Turkish characters intact:

  1. Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
  2. Copy and Paste your text containing Turkish characters.
  3. Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
  4. yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8" )
  5. yourdocument <- paste(yourdocument, collapse = " ")
  6. After this step you can create your corpus
  7. eg start from VectorSource() in tm package.
  8. Turkish characters will appear as they should.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM