简体   繁体   English

R:从GitHub读取UCS-2 LE bom文件

[英]R: Reading a UCS-2 LE bom file from GitHub

I have a program which creates and stores files automatically on GitHub. 我有一个程序可以自动在GitHub上创建和存储文件。 An example is https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master/test-999-666.txt 一个示例是https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master/test-999-666.txt

However, the files are coded on Dos/Windows machine with UCS-2 LE BOM (according to notepad++). 但是,这些文件是使用UCS-2 LE BOM在Dos / Windows机器上编码的(根据notepad ++)。

I am trying to read this text file into R but to no avail: 我试图将此文本文件读入R,但无济于事:

repo <- "https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master"
file <- "test-999-666.txt"
myurl  <- paste(repo, file, sep="/")
library(RCurl)
cnt <- getURL(myurl)

I get an error 我得到一个错误

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
 caractère nul au milieu de la chaîne : '<ff><fe>*'

How can I configure getURL to read this file? 如何配置getURL来读取此文件? I also tried with httr::GET (but receive an empty content). 我也尝试了httr :: GET(但收到一个空内容)。

This seems to be a relatively common pain point when working with files produced by Windows. 在处理Windows生成的文件时,这似乎是一个相对常见的痛点。 I'm going to be honest and say that the solution I'm presenting doesn't seem the best, because it mainly bypasses getting everything into the right encoding and instead goes to the binary directly. 我要说老实话,我要介绍的解决方案似乎并不是最好的,因为它主要绕过将所有内容转换为正确的编码,而是直接转到二进制文件。

Using the same variables as you: 使用与您相同的变量:

cnt <- getURLContent(myurl, binary = T)
cnt <- rawToChar(cnt[cnt != 00])

Should produce a parsable string. 应该产生一个可解析的字符串。

The idea is that instead of trying to have curl read the file, let it treat it like binary and deal with encoding later on. 这个想法是,与其尝试让curl读取文件,不如让它像二进制文件一样对待,然后再处理编码。 This gives us a vector of type raw . 这为我们提供了raw类型的向量。 Then, since the main issue seems to be that null characters (ie \\00 ) were causing a problem, we just exclude them from cnt before coerce cnt from raw to char . 然后,由于主要问题似乎是空字符(即\\00 )引起了问题,因此我们仅在将cntraw强制转换为char之前将它们从cnt排除。

In the end, from your example, I get 最后,从你的例子中,我得到

"ÿþ*** Header Start ***\r\nVersionPersist: 1\r\nLevelName: Session\r\nLevelName: Block\r\nLevelName: Trial\r\nLevelName: SubTrial\r\nLevelName: LogLevel5\r\nLevelName: LogLevel6\r\nLevelName: LogLevel7\r\nLevelName: LogLevel8\r\nLevelName: LogLevel9\r\nLevelName: LogLevel10\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\n*** Header End ***\r\nLevel: 1\r\n*** LogFrame Start ***\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\nClock.Information: <?xml version=\"1.0\"?>\\n<Clock xmlns:dt=\"urn:schemas-microsoft-com:datatypes\"><Description dt:dt=\"string\">E-Prime Primary Realtime Clock</Description><StartTime><Timestamp dt:dt=\"int\">0</Timestamp><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></StartTime><FrequencyChanges><FrequencyChange><Frequency dt:dt=\"r8\">2742255</Frequency><Timestamp dt:dt=\"r8\">492902384024</Timestamp><Current dt:dt=\"r8\">0</Current><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></FrequencyChange></FrequencyChanges></Clock>\\n\r\nStudioVersion: 2.0.10.252\r\nRuntimeVersion: 2.0.10.356\r\nRuntimeVersionExpected: 2.0.10.356\r\nRuntimeCapabilities: Professional\r\nExperimentVersion: 1.0.0.543\r\nExperimentStuff.RT: 2555\r\n*** LogFrame End ***\r\n"

Which seems to contain all the right content. 其中似乎包含所有正确的内容。

If you want you can try adding options(encoding = "UCS-2LE-BOM") before this code, I don't know if it changes anything, but it seems like it affects rawToChar . 如果您想尝试在此代码之前添加options(encoding = "UCS-2LE-BOM") ,我不知道它是否会更改任何内容,但似乎会影响rawToChar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM