I've got a dataframe f
in R with one column called utterance
which contains lines with character strings like:
~↑I don't think I can↑~
and ↓carrying↓
Whenever I'm using
write.table(f, "C:/Users/...txt", sep="\t", quote=F, row.names=F, fileEncoding = "UTF-8")
to create a table in a .txt
, Up and Down arrows are given like so in the created .txt
file:
<U+2191>
instead of the actual ↑
<U+2193>
instead of the actual ↓
~<U+2191>I don't think I can<U+2191>~
<U+2193>carrying<U+2193>
How can I fix this problem to get the actual ↑
and ↓
in the txt
files by using the correct settings for write.table
in R? I'm using the standard text editor of Windows10 and Notepad++.
There are some advices in the Escaping from character encoding hell in R on Windows (and all other known articles on this topic) however those do not seem to be useful for this particular case as the ↑
and ↓
characters do not come under any natural language.
Write file as UTF-8 encoding in R for Windows
… when the R writes a UTF-8 text into a file on Windows, characters of unsupported language are modified. In contrast, all characters are written correctly in Mac OS.
Using binary
There is a solution for this problem. Writing a binary file instead of a text file solves this. All applications handling a UTF-8 file in Windows are using the same trick.
BOM
The BOM should not be used in UTF-8 files. This is what the Linux and the Mac OS are doing. But the Windows Notepad and some applications use the BOM. So, handling the BOM is needed, in spite of grammatically wrong.
…
arrows.html (a sample UTF-8 file, used later in 70166451.r
)
<!DOCTYPE html>
<html>
<head> <meta charset="utf-8"> </head>
<body>up=↑ ↑↓ down=↓</body>
</html>
70166451.r (partially commented script):
### my circumstances
setwd("D:\\BAT\\R")
filepath = '70166451.txt'
### ↓↓↓ adapted from https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
BOM <- charToRaw('\xEF\xBB\xBF')
writeUtf8 <- function(xstr, filepath, forappend=F, bom=F) {
openmode <- ifelse(forappend, 'ab', 'wb')
con <- file( filepath, open=openmode)
if( !forappend && bom ) writeBin(BOM, con, endian="little")
# If the connection is open it is written from its current position:
writeBin(charToRaw(xstr), con, endian="little")
close(con)
}
### ↑↑↑ adapted from https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
### hard-coded characters ↑ and ↓
aa <- "up ↑ (↑↓) ↓ down" # unworkable? (not solved)
aa <- "up \u2191 (↑↓) \u2193 down" # unworkable! (unsolvable?)
aa <- "up \u2191 (\u2191\u2193) \u2193 down" # workable! (solved here)
# print( c( 'aa ', Encoding(aa), aa ))
# "aa " "UTF-8" "up <U+2191> (<U+2191><U+2193>) <U+2193> down"
xx <- data.frame( myword = c(aa,toupper(aa)), word = c(toupper(aa),aa))
yy <- readr::format_tsv( xx, append = F, quote_escape = "none", eol = "\r\n")
writeUtf8( yy, filepath)
### characters read from a file
library(xml2)
rawHTML <- paste(readLines("arrows.html", encoding='utf-8'), collapse=" ")
aaa <- xml_text(read_html(charToRaw(rawHTML)))
# print( c( 'aaa', Encoding(aaa), aaa ))
# "aaa" "UTF-8" "up=<U+2191> <U+2191><U+2193> down=<U+2193>"
xxx <- data.frame( myword = c(aaa,toupper(aaa)), word = c(toupper(aaa),aaa))
yyy <- readr::format_tsv( xxx, append = T, quote_escape = "none", eol = "\r\n")
writeUtf8( yyy, filepath, forappend=T)
Result (one can Copy & Paste above code snippet to an open R Console window, or save and run using Rscript.exe
as shown below):
pushd D:\bat\R & del 70166451*.txt & rscript 70166451.r & type 70166451*.txt & popd
70166451.txt myword word up ↑ (↑↓) ↓ down UP ↑ (↑↓) ↓ DOWN UP ↑ (↑↓) ↓ DOWN up ↑ (↑↓) ↓ down up=↑ ↑↓ down=↓ UP=↑ ↑↓ DOWN=↓ UP=↑ ↑↓ DOWN=↓ up=↑ ↑↓ down=↓
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.