简体   繁体   中英

a variation of Windy gridworld game problem in reinforcement learning with my matlab code

In reinforcement learning, a typical example is the windy gridworld

And I face with a new variation of windy gridworld, which additionally has a wall and stochastic wind, I am stuck in these two new things

Figure 1 shows a standard gridworld, with start (S) and goal (G) cells, but with two dierences: there is a wall the agent can not cross (indicated by the black cells) and there is a crosswind down and to the left at the right edge of the grid. The available actions in each cell are the king's moves | 8 actions in total for each cell. If any action would bring you outside the gridworld or collide with the wall, you end up in the nearest cell (eg going northeast in the top left cell will bring you one cell to the right). In the right region the resultant next cells are shifted down-left by a stochastic \\wind", the mean strength of which varies column by column. The mean strength of the wind is given below each column, in number of cells shifted down-left.

Due to stochasticity, the wind sometimes varies by 1 from the mean values given for each column (except if the mean is 0). That is, a third of the time you are shifted down-left exactly according to the values indicated below the column, a third of the time you are shifted one cell further down and left that, and another third of the time you are shifted one cell less than the mean. For example, if you are in the row of the wall and in the middle of the opening and you move up, then one-third of the time you end up one row west of that cell, one-third of the time you end up two colums west, one column south of that cell, and one-third of the time you end up at the same column north of that cell. The wind aects the cell you are in, not the cell you are going to.

Implement the Q-learning algorithm2 in the above problem with = 0:1, = 0:9 and initial Q(s; a) = 0 for all s; a. Each action generates a reward of rs = 􀀀1, except for the actions that lead immediately to the goal cell (rg = 10). Use the: -Greedy action selection method with = 0:2. Greedy action selection method with initial Q(s,a) > 0 and initial Q(s,a) < 0.

my matlab code will work.

My real problem is on the function nextPos = GiveNextPos(curPos, actionIndex, windpowers, gridCols, gridRows) , in which the agent will decide a action, and move to the next step. But there are many factors to influence the next step, such as stochastic wind and wall

so first question is about the stochastic wind, how can I program in matlab to say in 1/3 chance, it is 3, in another 1/3 chance, it is 1...

the second question is about colliding wall?should I firstly calculate the next step for king's walk and wind, and then use this next step value to check if I hit the wall or not???)

function WindyGridWorldQLearning()

    fprintf('WindyGridWorldQLearning\n'); 

    gamma = 0.9;
    alpha = 0.1;
    epsilon = 0.2;

    gridcols = 10; 
    gridrows = 7;
    windpowers = [0 0 0 0 1 1 2 2 1 1];
    fontsize = 16;
    showTitle = 1;

    episodeCount = 900;
    selectedEpisodes = [900];

    isKing = 1; 
    canHold = 0;

    start.row = 7;
    start.col = 1;
    goal.row = 1;
    goal.col = 1;

selectedEpIndex = 1;
 actionCount = 8; 

% initialize Q with zeros
Q = zeros(gridrows, gridcols, actionCount);

a = 0; % an invalid action
% loop through episodes
for ei = 1:episodeCount,
    %disp(sprintf('Running episode %d', ei));
    curpos = start;
    nextpos = start;

    %epsilon or greedy
    if(rand > epsilon) % greedy
        [qmax, a] = max(Q(curpos.row,curpos.col,:));
    else
        a = IntRand(1, actionCount);
    end

    while(PosCmp(curpos, goal) ~= 0)
        % take action a, observe r, and nextpos
        nextpos = GiveNextPos(curpos, a, windpowers, gridcols, gridrows);
        if(PosCmp(nextpos, goal) ~= 0), r = -1; else r = 10; end

        % choose a_next from nextpos
        [qmax, a_next] = max(Q(nextpos.row,nextpos.col,:));
        if(rand <= epsilon) % explore
            a_next = IntRand(1, actionCount);
        end

        % update Q:
        curQ = Q(curpos.row, curpos.col, a);
        nextQ = qmax; %Q(nextpos.row, nextpos.col, a_next);
        Q(curpos.row, curpos.col, a) = curQ + alpha*(r + gamma*nextQ - curQ);

        curpos = nextpos; a = a_next;
    end % states in each episode

    % if the current state of the world is going to be drawn ...
    if(selectedEpIndex <= length(selectedEpisodes) && ei == selectedEpisodes(selectedEpIndex))
        curpos = start;
        rows = []; cols = []; acts = [];
        for i = 1:(gridrows + gridcols) * 10,
            [qmax, a] = max(Q(curpos.row,curpos.col,:));
            nextpos = GiveNextPos(curpos, a, windpowers, gridcols, gridrows);
            rows = [rows curpos.row];
            cols = [cols curpos.col];
            acts = [acts a];

            if(PosCmp(nextpos, goal) == 0), break; end
            curpos = nextpos;
        end % states in each episode

        %figure;
        figure('Name',sprintf('Episode: %d', ei), 'NumberTitle','off');
        DrawWindyEpisodeState(rows, cols, acts, start.row, start.col, goal.row, goal.col, windpowers, gridrows, gridcols, fontsize);
        if(showTitle == 1),
            title(sprintf('Windy grid-world SARSA - episode %d - (\\epsilon: %3.3f), (\\alpha = %3.4f), (\\gamma = %1.1f)', ei, epsilon, alpha, gamma));
        end

        selectedEpIndex = selectedEpIndex + 1;
    end

end % episodes loop

function c = PosCmp(pos1, pos2)
c = pos1.row - pos2.row;
if(c == 0)
    c = c + pos1.col - pos2.col;
end

function nextPos = GiveNextPos(curPos, actionIndex, windpowers, gridCols, gridRows)
nextPos = curPos;
switch actionIndex
   case 1 % east
       nextPos.col = curPos.col + 1;
   case 2 % south
       nextPos.row = curPos.row + 1;       
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.row = curPos.row;  end     
   case 3 % west
       nextPos.col = curPos.col - 1;
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.col = curPos.col;  end 
   case 4 % north
       nextPos.row = curPos.row - 1;
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.row = curPos.row;  end 
   case 5 % northeast 
       nextPos.col = curPos.col + 1;
       nextPos.row = curPos.row - 1;
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.row = curPos.row;  end 
   case 6 % southeast 
       nextPos.col = curPos.col + 1;
       nextPos.row = curPos.row + 1;
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.row = curPos.row;  end 
   case 7 % southwest
       nextPos.col = curPos.col - 1;
       nextPos.row = curPos.row + 1;
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.row = curPos.row;  end 
   case 8 % northwest
       nextPos.col = curPos.col - 1;
       nextPos.row = curPos.row - 1;
       if(nextPos.row ==4 && nextPos.col <= 4 )   nextPos.row = curPos.row;  end 
   case 9 % hold
       nextPos = curPos;
   otherwise
      disp(sprintf('invalid action index: %d', actionIndex))
end

if(curPos.col > 4)    
    nextPos.row = nextPos.row - windpowers(nextPos.col);
    nextPos.col = nextPos.col - windpowers(nextPos.col);
end



if(nextPos.col <= 0), nextPos.col = 1; end
if(nextPos.col > gridCols), nextPos.col = gridCols; end

if(nextPos.row <= 0), nextPos.row = 1; end
if(nextPos.row > gridRows), nextPos.row = gridRows; end




function n = IntRand(lowerBound, upperBound)
n = floor((upperBound - lowerBound) * rand + lowerBound);




function DrawWindyEpisodeState(rows, cols, acts, SRow, SCol, GRow, GCol, windpowers, gridrows, gridcols, fontsize)
DrawGrid(gridrows, gridcols);
DrawTextOnCell('S', 0, SRow, SCol, gridrows, gridcols, fontsize);
DrawTextOnCell('G', 0, GRow, GCol, gridrows, gridcols, fontsize);

for i=1:length(rows),
    DrawActionOnCell(acts(i), rows(i), cols(i), gridrows, gridcols, fontsize);
end

for i=1:gridcols,
    [xc, yc] = FindColBaseCenter(i, gridrows, gridcols);
    text(xc, yc, sprintf('%d',windpowers(i)), 'FontSize', fontsize, 'Rotation', 0);
end



function DrawEpisodeState(rows, cols, acts, SRow, SCol, GRow, GCol, gridrows, gridcols, fontsize)
DrawGrid(gridrows, gridcols);
DrawTextOnCell('S', 0, SRow, SCol, gridrows, gridcols, fontsize);
DrawTextOnCell('G', 0, GRow, GCol, gridrows, gridcols, fontsize);

for i=1:length(rows),
    DrawActionOnCell(acts(i), rows(i), cols(i), gridrows, gridcols, fontsize);
end



function DrawGrid(gridrows, gridcols)
xsp = 1 / (gridcols + 2);
ysp = 1 / (gridrows + 2);

x = zeros(1, 2*(gridcols + 1));
y = zeros(1, 2*(gridcols + 1));
i = 1;
for xi = xsp:xsp:1 - xsp,
    x(2*i - 1) = xi; x(2*i) = xi;
    if(mod(i , 2) == 0)
        y(2*i - 1) = ysp;y(2*i) = 1-ysp;
    else
        y(2*i - 1) = 1 - ysp;y(2*i) = ysp;
    end
    i = i + 1;
end

x2 = zeros(1, 2*(gridrows + 1));
y2 = zeros(1, 2*(gridrows + 1));
i = 1;
for yi = ysp:ysp:1 - ysp,
    y2(2*i - 1) = yi; y2(2*i) = yi;
    if(mod(i , 2) == 0)
        x2(2*i - 1) = xsp;x2(2*i) = 1-xsp;
    else
        x2(2*i - 1) = 1 - xsp;x2(2*i) = xsp;
    end
    i = i + 1;
end

plot(x, y, '-');
hold on
plot(x2, y2, '-');
axis([0 1 0 1]);
axis off
set(gcf, 'color', 'white');



function DrawTextOnCell(theText, rotation, row, col, gridrows, gridcols, fontsize)
[xc, yc] = FindCellCenter(row, col, gridrows, gridcols);
text(xc, yc, theText,  'FontSize', fontsize, 'Rotation', rotation);







function DrawActionOnCell(actionIndex, row, col, gridrows, gridcols, fontsize)
rotation = 0;
textToDraw = 'o';
switch actionIndex
   case 1 % east
       textToDraw = '\rightarrow';
       rotation = 0;
   case 2 % south
       textToDraw = '\downarrow';
       rotation = 0;
   case 3 % west
       textToDraw = '\leftarrow';
       rotation = 0;
   case 4 % north
       textToDraw = '\uparrow';
       rotation = 0;
   case 5 % northeast 
       textToDraw = '\rightarrow';
       rotation = 45;
   case 6 % southeast 
       textToDraw = '\downarrow';
       rotation = 45;
   case 7 % southwest
       textToDraw = '\leftarrow';
       rotation = 45;
   case 8 % northwest
       textToDraw = '\uparrow';
       rotation = 45;

   otherwise
      disp(sprintf('invalid action index: %d', actionIndex))
end
DrawTextOnCell(textToDraw, rotation,  row, col, gridrows, gridcols, fontsize);




function [x,y] = FindCellCenter(row, col, gridrows, gridcols)
xsp = 1 / (gridcols + 2);
ysp = 1 / (gridrows + 2);
x = ((2*col + 1) / 2) * xsp;
y = 1 - (((2*row + 1) / 2) * ysp);
x = x - xsp/5;



function [x,y] = FindColBaseCenter(col, gridrows, gridcols)
row = gridrows + 1;
xsp = 1 / (gridcols + 2);
ysp = 1 / (gridrows + 2);
x = ((2*col + 1) / 2) * xsp;
y = 1 - (((2*row + 1) / 2) * ysp);
x = x - xsp/5;

For the wind just generate a random number n, say between 0 and 1. If you want 3 different behaviors each with a 1/3 chance, just have conditions for n < .33 , .33 < n < .66 ... etc.

I don't quite understand what you're saying with the wall, but you should check the action the agent will take and the effect the wind will have on it and then see if this results in you hitting a wall. If so take the appropriate action.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM