將Unicode字符映射到R中的語言

Question

我正在從泰米爾語（印度當地語言）的.pdf文件中提取數據，從pdf文件中提取R中的文本后，我得到了一些垃圾或Unicode字符格式的文本。 我無法將其映射到適當的文本或與pdf文件中相同的文本，這是代碼

library(tm)
library(pdftools)
library(qdapRegex)
library(stringr)
library(textreadr)

if(!require("ghit")){
  install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))
text <- extract_tables("D:/first.pdf")
 text[[1]][,2][3]

這給了我一些垃圾人物

"Â«Ã®Ã¹Â£Ã±Â¢Â«Ã°Ã¬Â¢Â¬Ã¬  , Ã¢Ã´Â¢Ã¬Â£Ã±Â¢ÃºÂ¢ Â«Ã³Â£ Ì"

我嘗試過更改unicode類型

library(stringi)
stri_trans_toupper("ÃªÂ¶Ã³Â®", locale = "Tamil")

但是沒有成功。 任何建議都是可取的。

謝謝。

Answer 1

如果您的文本已被成功提取，並且是轉換編碼的唯一問題，我認為iconv函數可以工作。 我提供了一個由“ cp932”（東亞語言）編碼的文本的示例。

# text file written in cp932
x <- readLines("test-cp932.txt", encoding="utf-8")  

x
## [1] "\x82\xa0\x82肪\x82Ƃ\xa4"
# this is garbled because the file has been read
# in a wrong encoding

iconv(x, "cp932", "utf-8")
## [1] "ありがとう"
# this means 'thank you'

如果仍然無法解決問題，則說明您的文本在解析過程中可能已被污染。

另一種可能性是使字符串成為原始對象（代碼），並使用像這樣的代碼映射來重新格式化原始文本。

charToRaw(x)
##  [1] 82 a0 82 e8 82 aa 82 c6 82 a4

Answer 2

此pdf不是unicode格式。 並且我找不到它的編碼模式http://dev.neechalkaran.com/p/oovan.html

您必須找到對其進行編碼的解決方案，或者使用unicode pdf

將Unicode字符映射到R中的語言

問題描述

2 個解決方案

解決方案1
2 2017-09-16 13:32:33

解決方案2
0 2017-10-23 10:43:25

將Unicode字符映射到R中的語言

問題描述

2 個解決方案

解決方案1 2 2017-09-16 13:32:33

解決方案2 0 2017-10-23 10:43:25

解決方案1
2 2017-09-16 13:32:33

解決方案2
0 2017-10-23 10:43:25