简体   繁体   中英

How to use write.table in R with up/down-arrows?

I've got a dataframe f in R with one column called utterance which contains lines with character strings like:

  • ~↑I don't think I can↑~ and
  • ↓carrying↓

Whenever I'm using

write.table(f, "C:/Users/...txt", sep="\t", quote=F, row.names=F, fileEncoding = "UTF-8")

to create a table in a .txt , Up and Down arrows are given like so in the created .txt file:

  • <U+2191> instead of the actual

  • <U+2193> instead of the actual

  • ~<U+2191>I don't think I can<U+2191>~

  • <U+2193>carrying<U+2193>

How can I fix this problem to get the actual and in the txt files by using the correct settings for write.table in R? I'm using the standard text editor of Windows10 and Notepad++.

There are some advices in the Escaping from character encoding hell in R on Windows (and all other known articles on this topic) however those do not seem to be useful for this particular case as the and characters do not come under any natural language.

Good news

Write file as UTF-8 encoding in R for Windows

… when the R writes a UTF-8 text into a file on Windows, characters of unsupported language are modified. In contrast, all characters are written correctly in Mac OS.

Using binary

There is a solution for this problem. Writing a binary file instead of a text file solves this. All applications handling a UTF-8 file in Windows are using the same trick.

BOM

The BOM should not be used in UTF-8 files. This is what the Linux and the Mac OS are doing. But the Windows Notepad and some applications use the BOM. So, handling the BOM is needed, in spite of grammatically wrong.

Solution

arrows.html (a sample UTF-8 file, used later in 70166451.r )

<!DOCTYPE html>
<html>
    <head> <meta charset="utf-8"> </head>
    <body>up=&#x2191;  ↑↓  down=&#x2193;</body>
</html>

70166451.r (partially commented script):

###                  my circumstances
setwd("D:\\BAT\\R")    
filepath = '70166451.txt'

### ↓↓↓ adapted from https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
BOM <- charToRaw('\xEF\xBB\xBF')
writeUtf8 <- function(xstr, filepath, forappend=F, bom=F) {
  openmode <- ifelse(forappend, 'ab', 'wb')
  con <- file( filepath, open=openmode)
  if( !forappend && bom ) writeBin(BOM, con, endian="little")
  # If the connection is open it is written from its current position:
  writeBin(charToRaw(xstr), con, endian="little")
  close(con)
}
###  ↑↑↑ adapted from https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

### hard-coded characters ↑ and ↓
aa <- "up ↑ (↑↓) ↓ down"                      # unworkable? (not solved)
aa <- "up \u2191 (↑↓) \u2193 down"            # unworkable! (unsolvable?)
aa <- "up \u2191 (\u2191\u2193) \u2193 down"  #   workable! (solved here)
# print( c( 'aa ', Encoding(aa), aa ))
# "aa "   "UTF-8"   "up <U+2191> (<U+2191><U+2193>) <U+2193> down"

xx <- data.frame( myword = c(aa,toupper(aa)), word = c(toupper(aa),aa))
yy <- readr::format_tsv( xx, append = F, quote_escape = "none", eol = "\r\n")
writeUtf8( yy, filepath)

### characters read from a file 
library(xml2)
rawHTML <- paste(readLines("arrows.html", encoding='utf-8'), collapse=" ")
aaa <- xml_text(read_html(charToRaw(rawHTML)))
# print( c( 'aaa', Encoding(aaa), aaa ))
# "aaa"   "UTF-8"   "up=<U+2191>  <U+2191><U+2193>  down=<U+2193>"

xxx <- data.frame( myword = c(aaa,toupper(aaa)), word = c(toupper(aaa),aaa))
yyy <- readr::format_tsv( xxx, append = T, quote_escape = "none", eol = "\r\n")
writeUtf8( yyy, filepath, forappend=T)

Result (one can Copy & Paste above code snippet to an open R Console window, or save and run using Rscript.exe as shown below):

pushd D:\bat\R & del 70166451*.txt & rscript 70166451.r & type 70166451*.txt & popd
 70166451.txt myword word up ↑ (↑↓) ↓ down UP ↑ (↑↓) ↓ DOWN UP ↑ (↑↓) ↓ DOWN up ↑ (↑↓) ↓ down up=↑ ↑↓ down=↓ UP=↑ ↑↓ DOWN=↓ UP=↑ ↑↓ DOWN=↓ up=↑ ↑↓ down=↓

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM