简体   繁体   中英

Matlab: How to fix the loop in my code for run a specific code to every table in a cell?

I have this code for:

  1. read my .xlsx files from my directory
  2. creat date for missed dates in a date columns
  3. set NaN for variables in these dates
  4. fill some columns with the next good value and fill some columns with an average of some other columns.

Here is the complete code:

clear
close all
clc
D = 'C:\Users\Behzad\Desktop\New folder (2)';
filePattern = fullfile(D, '*.xlsx');
file = dir(filePattern);
x = {};
for k = 1 : numel(file)
    baseFileName = file(k).name;
    fullFileName = fullfile(D, baseFileName);
    x{k} = readtable(fullFileName);
    fprintf('read file %s\n', fullFileName);
end
% allDates should be out of the loop because it's not necessary to be in the loop
dt1 = datetime([1982 01 01]);
dt2 = datetime([2018 12 31]);
allDates = (dt1 : calmonths(1) : dt2).';
allDates.Format = 'MM/dd/yyyy';
% 1) pre-allocate a cell array that will store
%   your tables (see note #3)
T2 = cell(size(x)); % this should work, I don't know what x is
% the x is xlsx files and have different sizes, so I think it should be in
% a loop?
% creating loop
for idx = 1:numel(x)
    T = x(idx);
    % 2) This line should probably be T = readtable(x(idx));
    sort = sortrows(T, 8);
    selected_table = sort (:, 8:9);
    tempTable = table(allDates(~ismember(allDates,selected_table.data)), NaN(sum(~ismember(allDates,selected_table.data)),size(selected_table,2)-1),'VariableNames',selected_table.Properties.VariableNames);
    T2 = outerjoin(sort,tempTable,'MergeKeys', 1);
    % 3) You're overwriting the variabe T2 on each iteration of the i-loop.
    % to save each table, do this
    T2(idx) = fillmissing(T2, 'next', 'DataVariables', {'lat', 'lon', 'station_elevation'});
    T2.tm_m(isnan(T2.tm_m)) = mean([T2.tmax_m(isnan(T2.tm_m)), T2.tmin_m(isnan(T2.tm_m))],2);

end

and here is the error:

Error using matlab.internal.math.sortrowsParseInputs>legacyParseCOL (line 106) Column sorting vector must contain integers with absolute value between 1 and the number of columns in the first argument. Error in matlab.internal.math.sortrowsParseInputs (line 29) [col,colProvided] = legacyParseCOL(col,n,in2); Error in sortrows (line 60) [col, nanflag, compareflag] = matlab.internal.math.sortrowsParseInputs(A,varargin{:});

I want to thanks stack overflow to give me this opportunity to share my problem with Matlab professional as I'm just a researcher in the other field and want to use Matlab temporary to solve my problem.

Here is a link for download small sample Download Here from Google Drive T2.mat and two of my xlsx files.

Update (More Information) That is what I want from this code: Overall, I have many excel (.xlsx) files that I want to work on them. Although the number of columns in all of these is the same, rows number is different in each file. the first point is in my data (.xlsx files) some dates do not exist, for example: (MM:DD/YYYY)

1/1/2010
2/1/2010
3/1/2010
5/1/2010 (you can see 4/1/2010 not exist)

I want to create a date rows for them and set NaN values for corresponding variables on these dates. Second I want to sort every .xlmx file by date (ascending). Third I want to fill the blank cell in the columns of lat , lon , and station_elevation according to the next good value in each column. Finally, I want to fill NaN values in the tm_m column by averaging column tmax_m and tmin_m .

Your question is multilayered and from the comments in your code I see that somebody else already tried to help you out. One tip I can give you to solve a problem like this with many level is to successively solve each of the sub problems. When you found out all the small solution then you can bring everything together.

The problems you think you want to solve are the following:

  1. read my .xlsx files from my directory

  2. creat date for missed dates in a date columns

  3. set NaN for variables in these dates

  4. fill some columns with the next good value and fill some columns with an average of some other columns.

The actual problem I see should be ordered as follow:

  1. Read a single .xlsx file from a given directory and load it inside a table

  2. Fill in missing value (NaN) for the lat, lon, station_elevation column

  3. Fill in NaN value for tm_m column by averaging column tmax_m and tmin_m. If tmax_m or tmin_m is NaN leave NaN value for tm_m

  4. Add in missing dates in the table by appending a whole row of NaN starting from 1982 01 01] to [2018 12 31].

  5. Sort this table by date

  6. Iterate process 1 to 5 for all the file in the directory

This last step is actually separate from everything else once you solve the step from 1 to 5 as you can bundle it into a function and just repeat it over and over again for all .xlxs file in your directory.

Here is a step by step construction of what your solution:

Step 1: Read a single .xlsx file from a given directory and load it inside a table

filename = "Qaen.xlsx";
current_table = readtable(filename);

This is super simple which means that when we will bundle the whole thing as a function in step 5 we won't have to worry about going across all the files since the function will be working with only one file. We can abstract this out of the problem.

Step 2: Fill in missing value (NaN) for the lat, lon, station_elevation column

current_table = fillmissing(current_table, 'next', 'DataVariables', {'lat', 'lon', 'station_elevation'});

This one is straightforward, also note that x in your code is actually not there anymore and it is replaced by a variable current_table which refer to the current table we are working with. It's a bit easier to parse.

Step 3: Fill in NaN value for tm_m column by averaging column tmax_m and tmin_m. If tmax_m or tmin_m is NaN leave NaN value for tm_m

% If one of the value tmax and tmin is NaN the mean will stay NaN
mean_tm =  mean([current_table.tmax_m(isnan(current_table.tm_m)), current_table.tmin_m(isnan(current_table.tm_m))],2);
current_table.tm_m(isnan(current_table.tm_m)) = mean_tm;

Here some of the value will be NaN since we do a mean between a matrix of 2 columns, which is exactly the intended behavior.

Step 4:Add in missing dates in the table by appending a whole row of NaN starting from 1982 01 01] to [2018 12 31].

start_date = datetime([1982 01 01]);
end_date = datetime([2018 12 31]);
allDates = (start_date : calmonths(1) : end_date).';
allDates.Format = 'yyyy-dd-MM';

missing_dates = allDates(~ismember(allDates,current_table.data));
nan_values = NaN(sum(~ismember(allDates,current_table.data)),1);
variable_names = {'data','value'};
tempTable = table(missing_dates, nan_values,'VariableNames',variable_names);
current_table = outerjoin(current_table,tempTable,'MergeKeys', 1);

I broke down the one liner that you had into multiple statement and used them as variable for the table creation. A huge one liner is rarely a good set up for a bug free code.

Step 5: Sort this table by date

current_table = sortrows(current_table, 8); 

Finally we simply sort like you did before.

Now if we want to repeat this process across many file we need to bundle the step from 1 to 5 into an easy to re-use function. If you need some help with function go here .

Function bundling of 1 to 5:

function [current_table] = process_xlsx(filename)
    current_table = readtable(filename);

    current_table = fillmissing(current_table, 'next', 'DataVariables', {'lat', 'lon', 'station_elevation'});

    % If one of the value tmax and tmin is NaN the mean will stay NaN
    mean_tm =  mean([current_table.tmax_m(isnan(current_table.tm_m)), current_table.tmin_m(isnan(current_table.tm_m))],2);
    current_table.tm_m(isnan(current_table.tm_m)) = mean_tm;


    % Create a list of all the dates between start date and end date
    start_date = datetime([1982 01 01]);
    end_date = datetime([2018 12 31]);
    allDates = (start_date : calmonths(1) : end_date).';
    allDates.Format = 'yyyy-dd-MM';

    missing_dates = allDates(~ismember(allDates,current_table.data));
    nan_values = NaN(sum(~ismember(allDates,current_table.data)),1);
    variable_names = {'data','value'};
    tempTable = table(missing_dates, nan_values,'VariableNames',variable_names);
    current_table = outerjoin(current_table,tempTable,'MergeKeys', 1);

    current_table = sortrows(current_table, 8); 
end

Now we can reuse this with step 6!

Step 6: Iterate process 1 to 5 for all the file in the directory

directory = 'C:\Users\Behzad\Desktop\New folder (2)';
filePattern = fullfile(directory, '*.xlsx');
file = dir(filePattern);
all_tables = cell(1,length(file));
for k = 1 : numel(file)
    baseFileName = file(k).name;
    fullFilename = fullfile(directory, baseFileName);

    fprintf('Processing file %s\n', fullFilename);
    all_tables{k} = process_xlsx(fullFilename);
end

If we put all of this together we get the following in one script:

% Goal of the script:
% 1)read my .xlsx files from my directory
% 2)creat date for missed dates in a date columns
% 3)set NaN for variables in these dates
% 4) fill some columns with the next good value and fill some columns with an average of some other columns.

% Clear up the workspace, the figures and the command window feed
clear
close all
clc

directory = 'C:\Users\Behzad\Desktop\New folder (2)';
filePattern = fullfile(directory, '*.xlsx');
file = dir(filePattern);
all_tables = cell(1,length(file));
for k = 1 : numel(file)
    baseFileName = file(k).name;
    fullFilename = fullfile(directory, baseFileName);

    fprintf('Processing file %s\n', fullFilename);
    all_tables{k} = process_xlsx(fullFilename);
end



function [current_table] = process_xlsx(filename)
    current_table = readtable(filename);

    current_table = fillmissing(current_table, 'next', 'DataVariables', {'lat', 'lon', 'station_elevation'});

    % If one of the value tmax and tmin is NaN the mean will stay NaN
    mean_tm =  mean([current_table.tmax_m(isnan(current_table.tm_m)), current_table.tmin_m(isnan(current_table.tm_m))],2);
    current_table.tm_m(isnan(current_table.tm_m)) = mean_tm;


    % Create a list of all the dates between start date and end date
    start_date = datetime([1982 01 01]);
    end_date = datetime([2018 12 31]);
    allDates = (start_date : calmonths(1) : end_date).';
    allDates.Format = 'yyyy-dd-MM';

    missing_dates = allDates(~ismember(allDates,current_table.data));
    nan_values = NaN(sum(~ismember(allDates,current_table.data)),1);
    variable_names = {'data','value'};
    tempTable = table(missing_dates, nan_values,'VariableNames',variable_names);
    current_table = outerjoin(current_table,tempTable,'MergeKeys', 1);

    current_table = sortrows(current_table, 8); 
end

Hope it helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM