讀取大量數據的最佳方式

Question

只是想知道將大數據文件讀入 MATLAB 的最佳方法是什么？ 我目前正在將 large.txt 文件作為表格讀取，並根據它們的日期將它們組合在一起。 我遇到的問題是 MATLAB 用完了 memory 我不確定解決這個問題的最佳方法。

我正在閱讀的文件具有結構化標題，格式如下：

Phone timestamp;sensor timestamp [ns];channel 0;channel 1;channel 2;ambient
2021-03-04T19:58:47.117;536601230968690944;-332253;-317025;-322290;-641916;
2021-03-04T19:58:47.124;536601230976138752;-332199;-316980;-322281;-641938;
2021-03-04T19:58:47.131;536601230983586560;-332214;-316982;-322224;-641979;
2021-03-04T19:58:47.139;536601230991034368;-332200;-316973;-322191;-641939;
2021-03-04T19:58:47.146;536601230998482176;-332160;-316958;-322216;-641963;

Answer 1

好吧，首先在 memory 中加載整個文件內容確實是一個壞主意，尤其是如果它是一個非常大的文件（它的整個內容甚至可能根本不適合 memory 可用）。 這樣做是為了限制磁盤訪問，或者當逐塊處理文件是復雜的編碼時，首先獲取所有文件然后處理它（只要它具有合理的大小）。

另一個問題是，無論文件內容是以原始還是按塊讀取的，是否也需要將文件中的所有“值”作為單獨的“值”保存在 memory 中？ 如果它們需要單獨保存，無論哪種方式，memory 都會用完。 萬一人們可以“忘記”數據或僅在需要時重新加載其中的一部分，則編碼變得更加復雜，但可以繞過大文件。

讓我們假設在您的情況下，文件內容是傳感器值的實時采集，您只需對它們進行平均以減少 memory 占用空間。 您可以執行fopen和fgetl以逐行獲取其內容。 請注意，雖然fgetl是逐行的，但操作系統會在 memory 中為您維護一個緩沖區，因此每一行都沒有磁盤訪問。

這是一個完整的例子：

我正在使用正則表達式來描述我在文件中尋找的那種行
在preallocateData子函數中：
- 我移動到文件的開頭
- 我逐行閱讀，直到找到第一個有趣的
- 然后我使用fseek快速移動直到文件幾乎結束
- 我逐行閱讀，直到找到最后一個有趣的
- 根據我想要的第一次/最后一次讀取時間戳和平均大小，我可以確定我的矩陣最終有多大，並預先分配它以優化速度。
在doAveraging子函數中：
- 我移動到文件的開頭
- 我逐行讀取並累積值，直到時間戳差異大於我選擇的平均持續時間
- 我將平均值存儲在我預先分配的數據集中
- 如果需要，我會修剪未使用的預分配塊的數據
注意：
- 即使傳感器停止發送值一段時間（即時間戳差距很大），代碼也應該可以工作
- 代碼可以在固定時隙從源或在最后記錄數據之后的一段時間內平均數據（請參閱fixBinAveragingMode ）
- 最終時間戳列表可能不是線性的（特別是如果原始時間戳存在間隙或者您不使用fixBinAveragingMode ）。 如果更實用，您可以在主 function 末尾添加對interp1的調用，以在線性間隔時間尺度上重新映射平均傳感器值。

%
% PURPOSE:
%
%   Averages real-time sensor values.
%
% SYNTAX:
%
%   [timeNs, data] = AveragingSensorData(filename, avgDurationNs);
%
% INPUTS:
%
%   - 'filename': Text file containing real-time sensor data
%   - 'avgDurationNs': Averaging duration (in nanoseconds)
%
% OUTPUT:
%
%   - 'timeNs': (1 x ncount) vector representing time (in nanoseconds)
%   - 'data': (ncount x 4) representing averaged sensor values
%             * First column is channel0
%             * Second cilumn is channel1
%             * Third column is channel2
%             * Fourth column is ambient
%

%% ---
function [timeNs, data] = AveragingSensorData(filename, avgDurationNs)
%[
    if (nargin < 2), avgDurationNs = 1e9; end %0.01*1e9; end
    if (nargin < 1), filename = '.\data.txt'; end

    % Regular expression pattern describing the type of line we are looking for in the file
    % Means : 
    %   - Start of line
    %   - Whatever except ';', 
    %   - ';', 
    %   - one-to-many-digits, (i.e timestamp)
    %   - ';' 
    %   - one-to-many-digits (eventually prefixed with [+/-], (i.e. channel0)
    %   - ... etc ...
    pattern = '^[^;]*;([0-9]+);([+\-]?[0-9]+);([+\-]?[0-9]+);([+\-]?[0-9]+);([+\-]?[0-9]+);\s*$';

    % Minimal check
    if (~(isnumeric(avgDurationNs) && isscalar(avgDurationNs) && isreal(avgDurationNs) && (avgDurationNs > 0)))
        error('Are you''re kidding me ?');
    end

    % So first lets try opening the file for later line-by-line reading it 
    [fid, err] = fopen(filename, 'rt');
    if (fid <= 0), error('Failed to open file: %s', err); end
    cuo = onCleanup(@()fclose(fid)); % This will close the file all cases (normal termination, or exception, or even ctrl+c)

    % Here based on number of lines in the files and averaging duration we
    % estimate the final size of the data and preallocate them.
    % NB: Quick exit is easy cases when there is 0 or single data line
    [timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs);
    if (canQuickExit), return; end

    % Do the averaging really
    fixBinAveraginMode = false; % Is averaging at fix or floating time position ?
    [timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveraginMode, timeNs, data);   
end

%% ---
function [timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs)
%[
    % Go back to the beginning of the file
    frewind(fid);

    % Look for first and last interesting lines in the file
    % NB: This assumes timestamps are sorted in increasing order
    nothingYet = true;
    fastAndFurious = true;
    firstReadTokens = []; lastReadTokens = [];
    while(true)

        % Read line-by-line until finding something interesting or eof
        tline = fgetl(fid);
        if (~ischar(tline)), break; end
        tokens = regexp(tline, pattern, 'tokens');
        if (~isscalar(tokens)), continue; end

        if (nothingYet) % It is the first time we found some interesting line
            nothingYet = false;
            firstReadTokens = tokens;
            lastReadTokens = tokens;            
            if (fastAndFurious)
                % Ok, don't bother reading each line, move almost to the
                % end of file directly. NB: This can be risky if there is
                % many empty lines at the end of the file, or if all lines
                % are not of the same length
                fseek(fid, -3 * numel(tline), 'eof');                
            end            
        else % This is not the first time
            lastReadTokens = tokens;
        end

    end

    % Conversion of matched tokens timestamps
    firstReadTimestamp = []; lastReadTimestamp = [];
    if (~isempty(firstReadTokens)), firstReadTimestamp = str2double(firstReadTokens{1}{1}); end
    if (~isempty(lastReadTokens)), lastReadTimestamp = str2double(lastReadTokens{1}{1}); end     

    % Compute preallocation
    if (isempty(firstReadTimestamp)), 
        % Easy, not a single line of data in the whole file
        timeNs = zeros(1, 0);
        data = zeros(0, 4);        
        canQuickExit = true;
    elseif (isempty(lastReadTimestamp) || (abs(lastReadTimestamp - firstReadTimestamp) < 0.1)), 
        % Easy again, just one line of data in te whole file        
        timeNs = zeros(1, 1);
        data = [str2double(firstReadTokens{1}{2}), str2double(firstReadTokens{1}{3}), str2double(firstReadTokens{1}{4}), str2double(firstReadTokens{1}{5})];
        canQuickExit = true;
    else
        % Ok, lets allocate
        estimateBlockCount = ceil((lastReadTimestamp - firstReadTimestamp) / avgDurationNs);
        timeNs = zeros(1, estimateBlockCount);
        data = zeros(estimateBlockCount, 4);
        canQuickExit = false;
    end
%]
end

%% ---
function [timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveragingMode, timeNs, data)
%[
    % Go back to the beginning of the file
    frewind(fid);

    % Look for interesting lines till the end
    % NB: We assume timestamps are sorted in increasing order in the file
    idx = 0;
    nothingYet = true;    
    while(true)

        % Read line-by-line until finding something interesting
        tline = fgetl(fid);
        if (~ischar(tline)), break; end
        tokens = regexp(tline, pattern, 'tokens');
        if (~isscalar(tokens)), continue; end

        lastReadTimestamp = str2double(tokens{1}{1});
        if (nothingYet) 
            nothingYet = false;

            idx = idx+1;
            avgCount = 1;
            timeNs(idx) = lastReadTimestamp;
            nextStopTimestamp = lastReadTimestamp + avgDurationNs;
            avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
        elseif (lastReadTimestamp > nextStopTimestamp)    
            data(idx, :) = avg / avgCount;

            idx = idx+1;
            avgCount = 1;
            if (fixBinAveragingMode)
                % Fixed time slots from origin
                offset = mod(lastReadTimestamp - timeNs(1), avgDurationNs);
                timeNs(idx) = (lastReadTimestamp - offset);
                nextStopTimestamp = timeNs(idx) + avgDurationNs;
            else
                % Run timer for averaging immediately after receiving data
                timeNs(idx) = lastReadTimestamp;
                nextStopTimestamp = lastReadTimestamp + avgDurationNs;
            end            
            avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];

        else
            avgCount = avgCount + 1;
            avg = avg + [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
        end
    end
    if (~nothingYet)
        timeNs = timeNs - timeNs(1); 
        data(idx, :) = avg / avgCount;
    end

    % Trim unused preallocated data if required
    timeNs((idx+1):end) = [];
    data((idx+1):end, :) = [];           
%]
end

該代碼也存儲在 GitHub 上：在一個非常大的文件中收集傳感器值的平均值

讀取大量數據的最佳方式

問題描述

1 個解決方案

解決方案1
0 已采納 2021-03-18 00:26:12

讀取大量數據的最佳方式

問題描述

1 個解決方案

解決方案1 0 已采納 2021-03-18 00:26:12

解決方案1
0 已采納 2021-03-18 00:26:12