简体   繁体   中英

Breaking a binary file into smaller parts in R

I have several large files that I would like to turn into binary. Once it is in binary, I would like to make each piece smaller than 5GB. So however many that may be, all exist as objects in R.

I'm not exactly sure where to begin with this but generally I have via psuedocode.

file <- ***FILE PATH****

binFile <- writeBin(file,con)

# loop through length of 'binFile' until file.size() = 5000000 then write to a list, continue with the rest and repeat til the whole file is complete.

#Then each item in the list can be called.

If it is easier to write them to my local as smaller binary files that can also work.

You can split a large file into several chunks without loading the whole file into memory. Here's a function that will do that.

You supply it with the path to your big file, the path to the directory in which you would like the chunks to be saved, and the maximum chunk size.

The files will all be saved with the name of the big file plus the chunk number, and the file type .bin .

The original file is left unchanged.

If you want to read the chunks into R as binary, you can just read them with readBin

chop_file <- function(bigfile, save_path, chunk_size)
{
  con <- file(bigfile, "rb")
  pos <- 0
  file_size <- file.size(bigfile)
  chunk_no <- 1
  filenames <- sapply(strsplit(bigfile, "/"), function(x) x[length(x)])
  filenames <- gsub("[.]", "", filenames)

  while(pos < file_size)
  {
    seek(con, pos)
    data <- readBin(con, "raw", chunk_size)
    pos  <- seek(con, 0)
    writeBin(data, paste0(save_path, filenames, chunk_no, ".bin"))
    chunk_no <- chunk_no + 1
  }

  close(con)
  message(paste("  File", bigfile, "split into", chunk_no - 1, "chunks"))
}

For example, if I have a single large binary file in the following directory:

dir("C:/Users/Me/pdfs/")
# [1] bigfile.pdf

And I wish to chunk it into 1MB pieces in the empty directory C:/Users/Me/chunks/ , I do:

chop_file("C:/Users/Me/pdfs/bigfile.pdf", "C:/Users/Me/chunks/", 1e6)
#>  File C:/Users/Me/pdfs/bigfile.pdf split into 10 chunks

And now

dir("C:/Users/Me/chunks/")
#> [1] "bigfilepdf1.bin"  "bigfilepdf2.bin"  "bigfilepdf3.bin"  "bigfilepdf4.bin" 
#> [5] "bigfilepdf5.bin"  "bigfilepdf6.bin"  "bigfilepdf7.bin"  "bigfilepdf8.bin" 
#> [9] "bigfilepdf9.bin"  "bigfilepdf10.bin"

If you wanted to stitch all these back together again in memory, you could do this:

data <- list()
files <- dir("C:/Users/Me/chunks/")
for(i in seq_along(files)) data[[i]] <- readBin(files[i], "raw", 10e6)
data <- do.call("c", data)

Then data will contain all the bytes of the original file as a raw vector.

identical(data, readBin("C:/Users/Me/pdfs/bigfile.pdf", "raw", 1e7))
#> [1] TRUE

If you would rather process the chunks in memory instead of writing them to disk, you can simplify things a bit:

chop_file_to_list <- function(bigfile, chunk_size)
{
  con <- file(bigfile, "rb")
  pos <- 0
  file_size <- file.size(bigfile)
  chunk_no <- 1
  data <- list()

  while(pos < file_size)
  {
    seek(con, pos)
    data[[chunk_no]] <- readBin(con, "raw", chunk_size)
    pos  <- seek(con, 0)
    chunk_no <- chunk_no + 1
  }

  close(con)
  return(data)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM