[英]Best way to read in large amounts of data
只是想知道將大數據文件讀入 MATLAB 的最佳方法是什么? 我目前正在將 large.txt 文件作為表格讀取,並根據它們的日期將它們組合在一起。 我遇到的問題是 MATLAB 用完了 memory 我不確定解決這個問題的最佳方法。
我正在閱讀的文件具有結構化標題,格式如下:
Phone timestamp;sensor timestamp [ns];channel 0;channel 1;channel 2;ambient
2021-03-04T19:58:47.117;536601230968690944;-332253;-317025;-322290;-641916;
2021-03-04T19:58:47.124;536601230976138752;-332199;-316980;-322281;-641938;
2021-03-04T19:58:47.131;536601230983586560;-332214;-316982;-322224;-641979;
2021-03-04T19:58:47.139;536601230991034368;-332200;-316973;-322191;-641939;
2021-03-04T19:58:47.146;536601230998482176;-332160;-316958;-322216;-641963;
好吧,首先在 memory 中加載整個文件內容確實是一個壞主意,尤其是如果它是一個非常大的文件(它的整個內容甚至可能根本不適合 memory 可用)。 這樣做是為了限制磁盤訪問,或者當逐塊處理文件是復雜的編碼時,首先獲取所有文件然后處理它(只要它具有合理的大小)。
另一個問題是,無論文件內容是以原始還是按塊讀取的,是否也需要將文件中的所有“值”作為單獨的“值”保存在 memory 中? 如果它們需要單獨保存,無論哪種方式,memory 都會用完。 萬一人們可以“忘記”數據或僅在需要時重新加載其中的一部分,則編碼變得更加復雜,但可以繞過大文件。
讓我們假設在您的情況下,文件內容是傳感器值的實時采集,您只需對它們進行平均以減少 memory 占用空間。 您可以執行fopen
和fgetl
以逐行獲取其內容。 請注意,雖然fgetl
是逐行的,但操作系統會在 memory 中為您維護一個緩沖區,因此每一行都沒有磁盤訪問。
這是一個完整的例子:
preallocateData
子函數中:
fseek
快速移動直到文件幾乎結束doAveraging
子函數中:
fixBinAveragingMode
)fixBinAveragingMode
)。 如果更實用,您可以在主 function 末尾添加對interp1
的調用,以在線性間隔時間尺度上重新映射平均傳感器值。%
% PURPOSE:
%
% Averages real-time sensor values.
%
% SYNTAX:
%
% [timeNs, data] = AveragingSensorData(filename, avgDurationNs);
%
% INPUTS:
%
% - 'filename': Text file containing real-time sensor data
% - 'avgDurationNs': Averaging duration (in nanoseconds)
%
% OUTPUT:
%
% - 'timeNs': (1 x ncount) vector representing time (in nanoseconds)
% - 'data': (ncount x 4) representing averaged sensor values
% * First column is channel0
% * Second cilumn is channel1
% * Third column is channel2
% * Fourth column is ambient
%
%% ---
function [timeNs, data] = AveragingSensorData(filename, avgDurationNs)
%[
if (nargin < 2), avgDurationNs = 1e9; end %0.01*1e9; end
if (nargin < 1), filename = '.\data.txt'; end
% Regular expression pattern describing the type of line we are looking for in the file
% Means :
% - Start of line
% - Whatever except ';',
% - ';',
% - one-to-many-digits, (i.e timestamp)
% - ';'
% - one-to-many-digits (eventually prefixed with [+/-], (i.e. channel0)
% - ... etc ...
pattern = '^[^;]*;([0-9]+);([+\-]?[0-9]+);([+\-]?[0-9]+);([+\-]?[0-9]+);([+\-]?[0-9]+);\s*$';
% Minimal check
if (~(isnumeric(avgDurationNs) && isscalar(avgDurationNs) && isreal(avgDurationNs) && (avgDurationNs > 0)))
error('Are you''re kidding me ?');
end
% So first lets try opening the file for later line-by-line reading it
[fid, err] = fopen(filename, 'rt');
if (fid <= 0), error('Failed to open file: %s', err); end
cuo = onCleanup(@()fclose(fid)); % This will close the file all cases (normal termination, or exception, or even ctrl+c)
% Here based on number of lines in the files and averaging duration we
% estimate the final size of the data and preallocate them.
% NB: Quick exit is easy cases when there is 0 or single data line
[timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs);
if (canQuickExit), return; end
% Do the averaging really
fixBinAveraginMode = false; % Is averaging at fix or floating time position ?
[timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveraginMode, timeNs, data);
end
%% ---
function [timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs)
%[
% Go back to the beginning of the file
frewind(fid);
% Look for first and last interesting lines in the file
% NB: This assumes timestamps are sorted in increasing order
nothingYet = true;
fastAndFurious = true;
firstReadTokens = []; lastReadTokens = [];
while(true)
% Read line-by-line until finding something interesting or eof
tline = fgetl(fid);
if (~ischar(tline)), break; end
tokens = regexp(tline, pattern, 'tokens');
if (~isscalar(tokens)), continue; end
if (nothingYet) % It is the first time we found some interesting line
nothingYet = false;
firstReadTokens = tokens;
lastReadTokens = tokens;
if (fastAndFurious)
% Ok, don't bother reading each line, move almost to the
% end of file directly. NB: This can be risky if there is
% many empty lines at the end of the file, or if all lines
% are not of the same length
fseek(fid, -3 * numel(tline), 'eof');
end
else % This is not the first time
lastReadTokens = tokens;
end
end
% Conversion of matched tokens timestamps
firstReadTimestamp = []; lastReadTimestamp = [];
if (~isempty(firstReadTokens)), firstReadTimestamp = str2double(firstReadTokens{1}{1}); end
if (~isempty(lastReadTokens)), lastReadTimestamp = str2double(lastReadTokens{1}{1}); end
% Compute preallocation
if (isempty(firstReadTimestamp)),
% Easy, not a single line of data in the whole file
timeNs = zeros(1, 0);
data = zeros(0, 4);
canQuickExit = true;
elseif (isempty(lastReadTimestamp) || (abs(lastReadTimestamp - firstReadTimestamp) < 0.1)),
% Easy again, just one line of data in te whole file
timeNs = zeros(1, 1);
data = [str2double(firstReadTokens{1}{2}), str2double(firstReadTokens{1}{3}), str2double(firstReadTokens{1}{4}), str2double(firstReadTokens{1}{5})];
canQuickExit = true;
else
% Ok, lets allocate
estimateBlockCount = ceil((lastReadTimestamp - firstReadTimestamp) / avgDurationNs);
timeNs = zeros(1, estimateBlockCount);
data = zeros(estimateBlockCount, 4);
canQuickExit = false;
end
%]
end
%% ---
function [timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveragingMode, timeNs, data)
%[
% Go back to the beginning of the file
frewind(fid);
% Look for interesting lines till the end
% NB: We assume timestamps are sorted in increasing order in the file
idx = 0;
nothingYet = true;
while(true)
% Read line-by-line until finding something interesting
tline = fgetl(fid);
if (~ischar(tline)), break; end
tokens = regexp(tline, pattern, 'tokens');
if (~isscalar(tokens)), continue; end
lastReadTimestamp = str2double(tokens{1}{1});
if (nothingYet)
nothingYet = false;
idx = idx+1;
avgCount = 1;
timeNs(idx) = lastReadTimestamp;
nextStopTimestamp = lastReadTimestamp + avgDurationNs;
avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
elseif (lastReadTimestamp > nextStopTimestamp)
data(idx, :) = avg / avgCount;
idx = idx+1;
avgCount = 1;
if (fixBinAveragingMode)
% Fixed time slots from origin
offset = mod(lastReadTimestamp - timeNs(1), avgDurationNs);
timeNs(idx) = (lastReadTimestamp - offset);
nextStopTimestamp = timeNs(idx) + avgDurationNs;
else
% Run timer for averaging immediately after receiving data
timeNs(idx) = lastReadTimestamp;
nextStopTimestamp = lastReadTimestamp + avgDurationNs;
end
avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
else
avgCount = avgCount + 1;
avg = avg + [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
end
end
if (~nothingYet)
timeNs = timeNs - timeNs(1);
data(idx, :) = avg / avgCount;
end
% Trim unused preallocated data if required
timeNs((idx+1):end) = [];
data((idx+1):end, :) = [];
%]
end
該代碼也存儲在 GitHub 上: 在一個非常大的文件中收集傳感器值的平均值
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.