R - 讀取沒有分隔符的二進制矩陣

Question

我試圖在R中讀取一個大的（~100mb）二進制矩陣。這就是明文的樣子：

10001010
10010100
00101101

預期產量：

  V1 V2 V3 V4 V5 V6 V7 V8
r1  1  0  0  0  1  0  1  0
r2  1  0  0  1  0  1  0  0
r3  0  0  1  0  1  1  0  1

我正在讀取每一行並分開這些位。 有沒有更有效的方法來做到這一點？

Answer 1

base R選項（可能很慢）將scan .txt文件，通過分隔符"" split元素，轉換為numeric/integer並rbind list元素以創建matrix 。

 m1 <- do.call(rbind,lapply(strsplit(scan("inpfile.txt", 
                 what=""), ""), as.numeric))
 m1
 #      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
 #[1,]    1    0    0    0    1    0    1    0
 #[2,]    1    0    0    1    0    1    0    0
 #[3,]    0    0    1    0    1    1    0    1

稍微快一點的版本是使用fread讀取文件，然后使用tstrsplit

library(data.table)
fread("inpfile.txt", colClasses="character")[, tstrsplit(V1, "")]
#    V1 V2 V3 V4 V5 V6 V7 V8
#1:  1  0  0  0  1  0  1  0
#2:  1  0  0  1  0  1  0  0
#3:  0  0  1  0  1  1  0  1

我還會通過使用awk在每個字符之間創建空格來更改分隔符（如果OP使用的是linux ）然后使用fread讀取（我無法測試它，因為我在windows系統上。）

更快的選擇還可能包括使用library(iotools)

n <- nchar(scan(file, what="",n=1))
library(iotools)
input.file("inpfile.txt", formatter=dstrfw, 
           col_types=rep("integer",n), widths=rep(1,n))
#  V1 V2 V3 V4 V5 V6 V7 V8
#1  1  0  0  0  1  0  1  0
#2  1  0  0  1  0  1  0  0
#3  0  0  1  0  1  1  0  1

基准

使用稍大的數據集， readr和iotools之間的時間如下。

n <-100000
cat(gsub("([[:alnum:]]{8})", "\\1\n", paste(sample(0:1, 
                n*8, TRUE), collapse="")), 
              file="dat2.txt")
library(readr)
tic <- Sys.time()
read_fwf("dat2.txt", fwf_widths(rep(1, 8)))
difftime(Sys.time(), tic)
#Time difference of 1.142145 secs

tic <- Sys.time()
input.file("dat2.txt", formatter=dstrfw, 
  col_types=rep("integer",8), widths=rep(1,8))
difftime(Sys.time(), tic)
#Time difference of 0.7440939 secs

library(LaF)
tic <- Sys.time()
laf <- laf_open_fwf("dat2.txt", column_widths = rep(1, 
    8),  column_types=rep("integer", 8))
## further processing (larger in memory)
dat <- laf[,]
difftime(Sys.time(), tic)
#Time difference of 0.1285172 secs

到目前為止效率最高的是@Tyler Rinker發布的library(LaF) ，其次是library(iotools)

Answer 2

使用readr的固定寬度文件閱讀器在大文件上這可能非常快：

library(readr)
read_fwf("dat.txt", fwf_widths(rep(1, 8)))

##      X1    X2    X3    X4    X5    X6    X7    X8
##   (int) (int) (int) (int) (int) (int) (int) (int)
## 1     1     0     0     0     1     0     1     0
## 2     1     0     0     1     0     1     0     0
## 3     0     0     1     0     1     1     0     1

我想擴大規模和時間。 在下面的過程中， readr ~7.5秒讀取的文件與您討論的文件相當。

n <-10000000
cat(gsub("([[:alnum:]]{8})", "\\1\n", paste(sample(0:1, n*8, TRUE), collapse="")), file="dat2.txt")

file.size('dat2.txt')  #100000000

tic <- Sys.time()
read_fwf("dat2.txt", fwf_widths(rep(1, 8)))
difftime(Sys.time(), tic)
## Time difference of 7.41096 secs

您可能還需要考慮使用LaF包來讀取大的固定寬度文件。 就像是：

library(LaF)
cols <- 8
laf <- laf_open_fwf("dat2.txt", column_widths = rep(1, cols), 
  column_types=rep("integer", cols))
## further processing (larger in memory)
dat <- laf[,]

R - 讀取沒有分隔符的二進制矩陣

問題描述

2 個解決方案

解決方案1
4 已采納 2016-01-17 04:06:01

基准

解決方案2
4 2016-01-17 04:16:07

R - 讀取沒有分隔符的二進制矩陣

問題描述

2 個解決方案

解決方案1 4 已采納 2016-01-17 04:06:01

基准

解決方案2 4 2016-01-17 04:16:07

解決方案1
4 已采納 2016-01-17 04:06:01

解決方案2
4 2016-01-17 04:16:07