Optimizing reading the data in Matlab

Question

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data . I have imported this text as a nx1 cell named Data . Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data . I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.

z = zeros(n,1);
 for i = 1:n
  if Data{i}(1)~='N'
     z(i) = str2double(Data{i});
  else
     z(i) = NaN;
  end
 end

Is there a way to optimize it?

Answer 1

Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):

data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');

Here is the content of the text file I used as a template for my test:

9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data

And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1 ):

Delimiter : specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format : you want numerical values.
TreatAsEmpty : this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.

Answer 2

If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.

filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
    val = rand(1);
    if val<0.01
        fwrite(fid,sprintf('%s\n','No Data'));
    else
        fwrite(fid,sprintf('%f\n',val*1000));
    end
end
fclose(fid)

%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc

%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc

%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char'); 
z=z{1};
fclose(fid);
toc

result:

Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.

txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.

Answer 3

Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.

EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.

The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.

The function I wrote to read a text file into a char matrix is:

function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
%   Read all the content of a text file to a matrix as wide as the longest
%   line on the file. Shorter lines are padded with blank spaces. New lines
%   are not included in the output.
%   New lines are identified by new line \n characters.

    % Reading the whole file in a string
    fid=fopen(filename,'r');
    fileData = char(fread(fid));
    fclose(fid);
    % Finding new lines positions
    newLines= fileData==sprintf('\n');
    linesEndPos=find(newLines)-1;

    % Calculating number of lines
    nLines=length(linesEndPos);
    % Calculating the width (number of characters) of each line
    linesWidth=diff([-1; linesEndPos])-1;
    % Number of characters per row including new lines
    charsPerRow=max(linesWidth)+1;

    % Initializing output var with blank spaces
    txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');

    % Computing a logical index to all characters of the input string to
    % their final positions
    charIdx=false(charsPerRow,nLines);
    % Indexes of all new lines
    linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
    charIdx(linearInd)=true;
    charIdx=cumsum(charIdx)==0;

    % Filling output matrix
    txtMat(charIdx)=fileData(~newLines);
    % Cropping the last row coresponding to new lines characters and transposing
    txtMat=txtMat(1:end-1,:)';
end

Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:

NoData=txtMat(:,1)=='N';

And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like

values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);

Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.

And to finish you need to set the NaN values with:

values(NoData)=NaN;

This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Optimizing reading the data in Matlab

Question

3 answers

solution1
2 ACCPTED 2018-02-22 20:45:06

solution2
0 2018-02-22 15:52:27

solution3
0 2018-02-23 03:23:23

Optimizing reading the data in Matlab

Question

3 answers

solution1 2 ACCPTED 2018-02-22 20:45:06

solution2 0 2018-02-22 15:52:27

solution3 0 2018-02-23 03:23:23

solution1
2 ACCPTED 2018-02-22 20:45:06

solution2
0 2018-02-22 15:52:27

solution3
0 2018-02-23 03:23:23