R data.table fread命令：如何讀取帶有不規則分隔符的大文件？

Question

我必須處理120個~2 GB（525600行×302列）文件的集合。 目標是制作一些統計數據並將結果放在干凈的SQLite數據庫中。

當我的腳本使用read.table（）導入時，一切正常，但速度很慢。 所以我嘗試使用fread，來自data.table包（版本1.9.2），但它給了我這個錯誤：

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

我的數據的前2行和7行看起來像這樣：

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

因此，開頭有第一個空格，日期列之間只有一個空格，其他列之間有任意數量的空格。

我試過用這樣的命令來轉換逗號中的空格：

DT <- fread(
            paste("sed 's/\\s\\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

沒有成功：問題仍然存在，使用sed命令似乎很慢。

Fread似乎不喜歡“任意數量的空間”作為分隔符或開頭的空列。 任何想法？

這是（可能）最小的可重復示例（40790之后的換行符）：

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

謝謝你的幫助！

更新： - data.table 1.8。*不會發生錯誤。 在這個版本中，表被讀作一個唯一的行，這並不是更好。

更新2 - 如評論中所述，我可以使用sed格式化表格然后用fread讀取它。 我在上面的答案中放了一個腳本，在那里我創建了一個樣本數據集，然后比較一些system.time（）。

Answer 1

致力於發展，v1.9.5。 fread()獲得帶有默認TRUE strip.white參數（而不是base::read.table() ，因為它更合乎需要）。 現在，示例數據已添加到測試中。

通過最近的提交：

require(data.table) # v1.9.5, commit 0e7a835 or more recent
ans <- fread(" YYYY MM DD HH mm             19490             40790\n   1991 10  1  1  0      1.046465E+00      1.568405E+00")
#      V1 V2 V3 V4 V5           V6           V7
# 1: YYYY MM DD HH mm 19490.000000 40790.000000
# 2: 1991 10  1  1  0     1.046465     1.568405
sapply(ans, class)
#          V1          V2          V3          V4          V5          V6          V7 
# "character" "character" "character" "character" "character"   "numeric"   "numeric"

Answer 2

sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g'

為你sed

不可能將fread的所有結果收集到1（臨時）文件中（添加源引用）並使用sed（或其他工具）處理此文件以避免在每次迭代時分叉工具？

Answer 3

通過NeronLeVelu和Clayton Stanlay的答案，我用自定義函數，示例數據和一些system.time（）完成了答案，以便進行比較。 這些測試是在Mac OS 10.9和R 3.0.2上進行的。 但是，我在linux機器上進行了相同的測試，並且sed命令的執行速度非常慢，而read.table（）則預先計算了nrows和colClasses。 fread部分非常快，兩個系統上的5e6行大約需要5秒。

library(data.table)


# create path to new temporary file
origData <- tempfile(pattern="origData",fileext=".txt")
# write table with irregular blank spaces separators.
write(paste0(" YYYY MM DD HH mm             19490             40790","\n",
                 paste(rep(" 1991 10  1  1  0      1.046465E+00      1.568405E+00", 5e6), 
                       collapse="\n"),"\n"),
      file=origData
)

# define column classes for read.table() optimization
colClasses <- c(rep('integer',5),rep('numeric',2))

# Function to count rows with command wc for read.table() optimization.
fileRowsCount <- function(file){
    if(file.exists(file)){
            sysCmd <- paste("wc -l", file)
            rowCount <- system(sysCmd, intern=T)
            rowCount <- sub('^\\s', '', rowCount)
        as.numeric(
                       strsplit(rowCount, '\\s')[[1]][1]
                      )
    }
}

# Function to sed data into temp file before importing with sed
sedFread<-function(file, sedCmd=NULL, ...){
    require(data.table)
    if(is.null(sedCmd)){
        #default : sed for convert blank separated table to csv. Thanks NeronLevelu !
        sedCmd <- "'s/^[[:blank:]]*//;s/[[:blank:]]\\{1,\\}/,/g'"
    }
    #sed into temp file
    tmpPath<-tempfile(pattern='tmp',fileext='.txt')
    sysCmd<-paste('sed',sedCmd, file, '>',tmpPath)
    try(system(sysCmd))
    DT<-fread(tmpPath,...)
    try(system(paste('rm',tmpPath)))
    return(DT)
}

Mac OS結果：

# First sed into temp file and then fread.
system.time(
DT<-sedFread(origData, header=TRUE)
)
> user  system elapsed
> 23.847   0.628  24.514

# Sed directly in fread command :
system.time(
DT <- fread(paste("sed 's/^[[:blank:]]*//;s/[[:blank:]]\\{1,\\}/,/g'", origData),
            header=T)
)
> user  system elapsed
> 23.606   0.515  24.219


# read.table without nrows and colclasses
system.time(
DF<-read.table(origData, header=TRUE)
)
> user  system elapsed
> 38.053   0.512  38.565

# read.table with nrows an colclasses
system.time(
DF<-read.table(origData, header=TRUE, nrows=fileRowsCount(origData), colClasses=colClasses)
)
> user  system elapsed
> 33.813   0.309  34.125

Linux結果：

# First sed into temp file and then fread.
system.time(
  DT<-sedFread(origData, header=TRUE)
)
> Read 5000000 rows and 7 (of 7) columns from 0.186 GB file in 00:00:05
> user  system elapsed 
> 47.055   0.724  47.789 

# Sed directly in fread command :
system.time(
DT <- fread(paste("sed 's/^[[:blank:]]*//;s/[[:blank:]]\\{1,\\}/,/g'", origData),
            header=T)
)
> Read 5000000 rows and 7 (of 7) columns from 0.186 GB file in 00:00:05
> user  system elapsed 
> 46.088   0.532  46.623 

# read.table without nrows and colclasses
system.time(
DF<-read.table(origData, header=TRUE)
)
> user  system elapsed 
> 32.478   0.436  32.912 

# read.table with nrows an colclasses
system.time(
DF<-read.table(origData,
               header=TRUE, 
               nrows=fileRowsCount(origData),
               colClasses=colClasses)
 )
> user  system elapsed 
> 21.665   0.524  22.192 

# Control if DT and DF are identical : 
setnames(DT, old=names(DT), new=names(DF))
identical(as.data.frame(DT), DF)                                                              
>[1] TRUE

很好：在這種情況下，我首先使用的方法是最有效的。

感謝NeronLeVelu，Matt Dowle和Clayton Stanley！

Answer 4

我找到了另一種方法，用awk而不是sed更快地完成它。 這是另一個例子：

library(data.table)

# create path to new temporary file
origData <- tempfile(pattern="origData",fileext=".txt")

# write table with irregular blank spaces separators.
write(paste0(" YYYY MM DD HH mm             19490             40790","\n",
            paste(rep(" 1991 10  1  1  0      1.046465E+00      1.568405E+00", 5e6),
            collapse="\n"),"\n"),
            file=origData
  )


# function awkFread : first awk, then fread. Argument : colNums = selection of columns. 
awkFread<-function(file, colNums, ...){
        require(data.table)
        if(is.vector(colNums)){
            tmpPath<-tempfile(pattern='tmp',fileext='.txt')
            colGen<-paste0("$",colNums,"\",\"", collapse=",")
            colGen<-substr(colGen,1,nchar(colGen)-3)
            cmdAwk<-paste("awk '{print",colGen,"}'", file, '>', tmpPath)
            try(system(cmdAwk))
            DT<-fread(tmpPath,...)
            try(system(paste('rm', tmpPath)))
            return(DT)
        }
}

# check read time :
system.time(
            DT3 <- awkFread(origData,c(1:5),header=T)
            )

> user  system elapsed 
> 6.230   0.408   6.644

Answer 5

如果峰值內存不是問題，或者您可以將其以可管理的塊流式傳輸，則以下gsub() / fread()混合應該可以工作，將所有連續空格字符轉換為您選擇的單個分隔符（例如"\\t" ），在通過fread()解析之前：

fread_blank = function(inputFile, spaceReplace = "\t", n = -1, ...){
  fread(
    input = paste0(
      gsub(pattern = "[[:space:]]+",
           replacement = spaceReplace,
           x = readLines(inputFile, n = n)),
      collapse = "\n"),
    ...)
}

我必須同意其他人認為空格分隔的文件不是理想的選擇，但我經常遇到它們是否喜歡它。

R data.table fread命令：如何讀取帶有不規則分隔符的大文件？

問題描述

5 個解決方案

解決方案1
5 2015-09-16 00:42:37

解決方案2
4 已采納 2014-03-06 15:50:44

解決方案3
3 2014-03-10 11:01:12

解決方案4
2 2014-03-10 16:18:18

解決方案5
1 2016-01-26 00:22:21

R data.table fread命令：如何讀取帶有不規則分隔符的大文件？

問題描述

5 個解決方案

解決方案1 5 2015-09-16 00:42:37

解決方案2 4 已采納 2014-03-06 15:50:44

解決方案3 3 2014-03-10 11:01:12

解決方案4 2 2014-03-10 16:18:18

解決方案5 1 2016-01-26 00:22:21

解決方案1
5 2015-09-16 00:42:37

解決方案2
4 已采納 2014-03-06 15:50:44

解決方案3
3 2014-03-10 11:01:12

解決方案4
2 2014-03-10 16:18:18

解決方案5
1 2016-01-26 00:22:21