R-无法读取带有控制字符[SUB]的文件

Question

I've had this issue before, but my previous solution doesn't fix it. 我以前曾遇到过此问题，但以前的解决方案无法解决它。

In my text-data, in Notepad++ when I show all characters, a character listed as [SUB] appears. 在我的文本数据中，当我显示所有字符时，在Notepad ++中，出现一个列为[SUB]的字符。

PREVIOUSLY, I deleted these by doing this... 以前，我通过执行此操作删除了这些文件...

## Read the file in as Binary
r = readBin( curFile, raw(), file.info(curFile)$size)

## Convert the pesky characters
if ((r[1]==as.raw(0x1a)))
{
    ## Find it
    spot = which(r == as.raw(0x1a) )
    r[r == as.raw(0x1a)] = as.raw(0x20)
}

However, this isn't working. 但是，这不起作用。 It seems like every time I manage to escape an invisible character, within a week, another one causes me a problem. 似乎每次我设法逃避一个看不见的角色，一周之内，另一个问题就给我带来了麻烦。 Is there a way to just "clean" a file effectively of all invisible control characters other than the new-lines separating my data entries? 除了分隔数据条目的换行符以外，是否有办法有效地“清除”所有看不见的控制字符的文件？

Please let me know. 请告诉我。 This is maddening already. 这已经发疯了。

Thanks! 谢谢！

I can make a limited CSV file for you all to try. 我可以制作一个有限的CSV文件，供大家尝试。 It's the second line, 4th column that causes the crash. 是导致崩溃的第二行第四列。

http://www.megafileupload.com/6ead/stackOverflow.csv http://www.megafileupload.com/6ead/stackOverflow.csv

The entire code I was using to do this is below.... 下面是我用来执行此操作的整个代码。

library(stringr)
############# DO THIS FIRST 
folder = "C:\\Twitter_TimeSeries\\Bernie_Practice\\"

## Get the file name of every file in the directory 
file.names = dir(folder, pattern=".csv")

## Figure out how many files there are
numFiles = length(file.names)

## Loop through every file 
for( i in 1:length(file.names))
{
    ## Which file are we on?
    curFile = paste( folder, file.names[i], sep="" )

    ## Read the file in as Binary
    r = readBin( curFile, raw(), file.info(curFile)$size)

    ## Convert the pesky characters
    if ((r[1]==as.raw(0x1a)))
    {
        ## Find it
        spot = which(r == as.raw(0x1a) )
        r[r == as.raw(0x1a)] = as.raw(0x20)
    } 
    if ((r[1]==as.raw(0x0a))) {
        ## Find it
        spot = which(r == as.raw(0x0a) )
        r[r == as.raw(0x1a)] = as.raw(0x20)
    } ## If 
    ## Re-write the file
    writeBin(r, curFile)
} ## For

curFile = stackOverflow.csv
rawData = read.csv(curFile, stringsAsFactors=FALSE)

Answer 1

尝试使用正则表达式将数据限制为仅允许的字符。

x = read.csv("foo.csv",colClasses="character") x = gsub("[^0-9\\\\.]","",x) # just numbers and '.' x = as.numeric(x) # Assuming your file really represents numeric data

R-无法读取带有控制字符[SUB]的文件

问题描述

1 个解决方案

解决方案1
0 2016-03-06 17:00:31

R-无法读取带有控制字符[SUB]的文件

问题描述

1 个解决方案

解决方案1 0 2016-03-06 17:00:31

解决方案1
0 2016-03-06 17:00:31