简体   繁体   中英

Read vector from file

I have a large vector of length 650 million. I wish to store this vector on disk (5 GB), then load the entire vector into memory so that various functions can quickly access its elements.

Here is my attempt to do this in Rcpp on a smaller scale. The following code simply causes my R session to crash, with no error messages. What am I doing wrong?

R code:

output_file = file(description="test.bin",open="a+b")
writeBin(runif(10), output_file,size=8)
close(output_file)

Rcpp code:

#include <Rcpp.h>
#include <fstream>
using namespace Rcpp;

std::vector<double> read_vector_from_file(std::string filename)
{
  std::vector<char> buffer{};
  std::ifstream ifs(filename, std::ios::in | std::ifstream::binary);
  std::istreambuf_iterator<char> iter(ifs);
  std::istreambuf_iterator<char> end{};
  std::copy(iter, end, std::back_inserter(buffer));
  std::vector<double> newVector(buffer.size() / sizeof(double));
  memcpy(&newVector[0], &buffer[0], buffer.size());
  return newVector;
}

std::vector<double> LT = read_vector_from_file("test.bin");

// [[Rcpp::export]]
double Rcpp_test() {
  return LT[3];
}

Over the years I have implemented something like the above a few times for quick and dirty data story. These days I no longer recommend it as we have fabulous packages such as fst and qs who do this better, with parallelisation, and compression, and other whistles.

But as you asked, an answer follows. I have found the C API for files to be simpler, and closer to what you do in R. So here we just open, and read 10 items of size 8 (for double ) as that is what we know you wrote. I have at time generalized that and written two int values for an enum of types as well as numbers.

Code

#include <Rcpp.h>
#include <fstream>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericVector Rcpp_test(std::string filename, size_t size) {
    Rcpp::NumericVector v(size);
    FILE *in = fopen(filename.c_str(), "rb");
    if (in == nullptr) Rcpp::stop("Cannot open file", filename);
    auto nr = fread(&v[0], sizeof(double), size, in);
    if (nr != size) Rcpp::stop("Bad payload");
    Rcpp::Rcout << nr << std::endl;
    fclose(in);
    return v;
}

/*** R
set.seed(123)
rv <- runif(10)
filename <- "test.bin"
if (!file.exists(filename)) {
  output_file <- file(description="test.bin",open="a+b")
  writeBin(rv, output_file, size=8)
  close(output_file)
}
nv <- Rcpp_test(filename, 10)
data.frame(rv, nv)
all.equal(rv,nv)
*/

Output

The code is a slight generalization by fixing the seed and comparing written and read data.

> Rcpp::sourceCpp("answer.cpp")

> set.seed(123)

> rv <- runif(10)

> filename <- "test.bin"

> if (!file.exists(filename)) {
+   output_file <- file(description="test.bin",open="a+b")
+   writeBin(rv, output_file, size=8)
+   close(output_file .... [TRUNCATED] 

> nv <- Rcpp_test(filename, 10)
10

> data.frame(rv, nv)
          rv        nv
1  0.2875775 0.2875775
2  0.7883051 0.7883051
3  0.4089769 0.4089769
4  0.8830174 0.8830174
5  0.9404673 0.9404673
6  0.0455565 0.0455565
7  0.5281055 0.5281055
8  0.8924190 0.8924190
9  0.5514350 0.5514350
10 0.4566147 0.4566147

> all.equal(rv,nv)
[1] TRUE
> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM