简体   繁体   English

Matlab:加速读取ascii文件

[英]Matlab: Speed-up reading of ascii file

I wrote this piece of code which works fine, but is way too slow for my purposes: 我写了这段代码,效果很好,但是对于我的目的来说太慢了:

%%% load nodal data %%% 
path = sprintf('%sfile.dat',directory);
fid = fopen(path);

num_nodes = textscan(fid,'%s %s %s %s %d',1,'delimiter', ' ');
num_nodes = num_nodes{5};
header = textscan(fid,'%s',7,'delimiter', '\t');

k = 0;
while ~feof(fid)

    line        = fgetl(fid);
    [head,rem]  = strtok(line,[' ',char(9)]); 

    if head == '#'
        k = k+1;
        j = 1;
        time_steps(k)  = sscanf(rem, [' Output at t = %d']);        
    end

    if ~isempty(head)
        if head ~= '#'
            data(j,:,k)  = str2num([head rem]); 
            j = j+1;
        end
    end

end
fclose(fid);

nodal_data = struct('header',header,'num_nodes',num_nodes,'time_steps',time_steps,'data',data);

The ascii I am reading into Matlab looks something like this: 我正在阅读Matlab的ascii看起来像这样:

# Number of Nodes: 120453
#X                  Y                   Z                   depth               vel_x               vel_y               wse             
# Output at t = 0
       76456.003              184726             3815.75                   0                   0                   0             3815.75
       76636.003              184726             3728.25                   0                   0                   0             3728.25
       76816.003              184726                3627                   0                   0                   0                3627
       76996.003              184726             3527.75                   0                   0                   0             3527.75
       77176.003              184726              3371.5                   0                   0                   0              3371.5
# Output at t = 36000.788
       76456.003              184726             3815.75                   0                   0                   0             3815.75
       76636.003              184726             3728.25                   0                   0                   0             3728.25
       76816.003              184726                3627                   0                   0                   0                3627
       76996.003              184726             3527.75                   0                   0                   0             3527.75
       77176.003              184726              3371.5                   0                   0                   0              3371.5

While the code I wrote works for files very small, it blows up on me for larger ascii files. 虽然我编写的代码适用于很小的文件,但对于较大的ascii文件却不胜枚举。 I already had to abort loading a ~25mb ascii (approximately 240k lines), which was just a test file. 我已经不得不中止加载〜25mb的ascii(大约240k行),这只是一个测试文件。 Later versions of the file will be ~500mb. 该文件的更高版本将为〜500mb。 Is there a way of speeding up the process of loading the file I am not happy with the 3 if-statements, but I did not know how to seperate '#' from numbers with a switch on head, especially because I was not able to distinguish 'head' by class, ie I was trying to check either for ischar or isnumeric, but as the variable 'head' is read as string, it will always be the case of ischar and never never isnumeric = true . 有没有一种方法可以加快加载文件的过程,我对这3个if语句不满意,但是我不知道如何通过切换头部将数字与数字分开,尤其是因为我无法按类区分“ head”,即我试图检查ischar或isnumeric,但是由于变量“ head”被读取为字符串,因此始终是ischar的情况,永远不会isnumeric = true I am also not very happy with using a tokenizer at all to being able to use the if-cases and then putting together the line here: str2num([head rem]); 我对使用分词器对使用if-cases然后在此处放一行也不太满意: str2num([head rem]); , as this probably consumes a lot of time. ,因为这可能会花费很多时间。 However, I did not know how else to do it. 但是,我不知道该怎么办。 So if you have any useful suggestions as of how to adapt my code, I would highly appreciate them! 因此,如果您对如何修改我的代码有任何有用的建议,我将不胜感激!

Have a good sunday and thank you in advance! 祝您周日愉快,并提前谢谢您!

The code below does reads approx 70000 timesteps with 5 nodes per step in around 7 seconds. 下面的代码确实在7秒钟内读取了大约70000个时间步长,每步5个节点。 It does most of what your code does and it should be easy enough to add the extra features of your code. 它完成了代码的大部分工作,并且应该很容易添加代码的额外功能。 There will be other ways of doing this faster but hopefully this should be adequate. 还有其他方法可以更快地完成此操作,但希望这应该足够了。

filename = 'd:\temp\input.txt';

filetext = fileread(filename);
headerLines = 2;
valuesPerLine = 7;
expr = '[^\n]*[^\n]*';
lines = regexp(filetext, expr, 'match');
isTimeStep = cellfun(@(x) strncmp(x,'#',1), lines );
numTimeSteps = sum(isTimeStep)-headerLines;
nodesPerStep = ((length(lines)-headerLines) / numTimeSteps ) - 1;
data = zeros(nodesPerStep, valuesPerLine, numTimeSteps);

for timeStep = 1:numTimeSteps
    lineIndex = headerLines + (timeStep-1) * (nodesPerStep + 1) + 2;
    for node = 1:nodesPerStep
        data(node, :, timeStep ) = sscanf(lines{lineIndex},'%f');
        lineIndex = lineIndex + 1;
    end    
end

Just tried it on a 2 million line file (340000 time steps with 5 nodes per step) and it took approx 36 seconds to run. 刚刚在200万行文件中进行了尝试(340000个时间步长,每个步长5个节点),运行大约需要36秒。

If you want a solution that doesn't have coded loops, you could replace from the code from 如果您想要一个没有编码循环的解决方案,则可以从

data = zeros(....

with

values = cellfun(@(x) sscanf(x,'%f'),lines(~isTimeStep),'uniformoutput',false);
data = reshape(cell2mat(values), nodesPerStep, valuesPerLine, numTimeSteps);

but it takes about 50% longer to run. 但是运行大约需要50%的时间。

First thing to do before you change anything is to PRE-ALLOCATE all output arrays: 进行任何更改之前,要做的第一件事是分配所有输出数组:
Your code outputs time_steps and data , all growing inside the loop. 您的代码输出time_stepsdata ,它们都在循环内增长。 This can kill you performance. 这可能会损害您的性能。

Assuming there are always five lines between each time step . 假设每个时间步长之间总是有五行

Add the following lines before the loop 在循环之前添加以下行

data = reshape( NaN( num_nodes, 7 ), [], 7, 5 ); % assuming 7 columns and 5 lines for each time step
time_steps = NaN( num_nodes / 5 );

after the loop just discard remaining NaNs 循环后只丢弃剩余的NaN

data( isnan(data) ) = [];
time_step( isnan(time_step) ) = [];

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM