0:00 welcome back in this video I will
0:02 discuss how to apply Q learning
0:04 algorithm for the given problem
0:06 definition this is the solved example
0:08 number one the link for other examples
0:10 is given in the description below
0:12 in this case we have been given a
0:14 building with five rooms like zero one
0:17 two three and four other rooms the
0:20 outside is considered as one big room
0:22 and it is represented as 5 in this case
0:25 between each room there are some doors
0:27 are there the meaning is that the agent
0:29 can go from in this case 0 to 4 or 420
0:33 in this case similarly the agent can go
0:35 from one to three or three to one in
0:37 this case
0:38 so what we do is we will convert this
0:40 particular building into a state and
0:42 actions over here each room is
0:44 represented as the state and the door is
0:48 represented as an action in this
0:49 particular case so this is how actually
0:52 the thing will look like uh this is the
0:54 state that is 0 1 2 3 4 and 5 between 0
0:59 and 4 there is a door here it represents
1:02 an action in this case so the user can
1:04 or the agent can go from zero to four or
1:07 he can come from 4 to 0. similarly uh
1:10 the other things can happen and one more
1:12 thing is there is no door between five
1:14 and five so the user can go from five to
1:17 five over here so that is represented
1:19 here
1:20 so in this case we assume that 5 is the
1:23 goal State here so we need to identify
1:25 an optimal path from each and every uh
1:28 state to this particular goal State over
1:30 here
1:31 and one more very important thing we
1:33 need to remember is uh an action which
1:35 will lead to this particular goal state
1:37 will get an instance reward of 100 here
1:39 remaining all are 0 in this particular
1:42 case so this is the one action uh this
1:45 is second action and third action which
1:46 will lead to this particular goal state
1:48 so each and every action is given an
1:50 instance reward of hundred remaining all
1:52 are zero in this particular case
1:55 now uh what we try to do is we will try
1:59 to apply the Q learning algorithm to
2:01 this particular I can say that state
2:03 diagram to get the optimal path over
2:07 here so the very first thing what we
2:09 need to do is we need to write the
2:11 reward Matrix the reward Matrix contains
2:14 the states or the rows and action as the
2:17 columns over here in this case we have
2:20 six states namely 0 to 5 actions are
2:24 again six here namely 0 to 5 in this
2:27 case
2:28 now I I will show you how to put fill
2:30 this particular reward Matrix let us
2:33 assume that you are present in state 0.
2:35 when you are present in 0 you can
2:37 perform only one action that is this is
2:38 the action which will go to uh state
2:41 number four that is uh when you are
2:44 present in state 0 you can perform an
2:46 action four
2:47 and then what is the reward over here
2:49 that is zero remaining all or minus 1 in
2:52 this case so from this particular row
2:54 you can understand that when you are
2:56 present in state 0 you can perform only
2:58 action 4 over here
2:59 similarly I will talk about this
3:01 particular second row when you are
3:03 present in state number one you can
3:05 perform an action of five here or you
3:08 can perform an action 3 here if you
3:10 perform an action 3 the reward is zero
3:12 if you perform an action 5 the reward is
3:15 100 here so when you are present in one
3:17 if you perform an action 3 the reward is
3:19 zero if you perform an action five the
3:21 reward is 100 here similarly we have to
3:23 fill this particular entire Matrix here
3:25 minus 1 indicates there is no direct
3:28 Edge between that particular States over
3:30 here
3:31 now uh this algorithm I have discussed
3:33 in detail in the previous video the link
3:36 for that particular videos present in
3:37 the description below go through that
3:39 particular video so that you can
3:41 understand the Q learning algorithm in
3:44 detail uh that will help you to
3:46 understand this particular example over
3:48 here
3:49 now uh coming back to the next part of
3:52 this particular algorithm the very first
3:54 thing what is required is we need some
3:55 learning rate I will initialize it to
3:57 0.8 we need to start from One initial
4:00 State I will consider the initial State
4:02 as one over here and then we need to
4:05 initialize the Q Matrix over here the Q
4:08 Matrix is initialized to 0 initially so
4:10 we have put 0 for every state and action
4:12 in this particular case
4:14 now what I do is uh I will consider the
4:17 initial State as one as I said earlier
4:19 though we will start with the initial
4:21 State as one
4:22 because initial state is 1 we can
4:25 perform two actions over here either we
4:27 can perform three or we can five couple
4:29 from five here so if you perform three
4:31 you will get the immediate reward of
4:33 zero and if you perform an action five
4:35 you will get an immediate word of 100
4:37 here so between these two we need to
4:39 select one action here let us assume
4:43 that I will select an action five in
4:44 this case if I select an action 5 I will
4:48 get the immediate reward of 100 in this
4:49 case and the next state will become 5
4:52 over here okay so the current state is 1
4:54 and the next state is 5 and the
4:56 immediate reward in this case is 100
4:57 over here now when you get this
5:00 particular next state as hun 5 we need
5:02 to identify what all things we can
5:05 perform when we are present in the next
5:06 state five over here so what all things
5:09 you can perform you can perform action
5:10 one because here it is zero you can
5:13 perform action 4 because it is 0 here
5:15 you can perform action 5 here because it
5:17 is 100 here
5:18 you cannot perform any other things
5:20 because uh when you are present in this
5:23 particular uh what you can uh five you
5:26 can cannot perform action 0 because it
5:28 is minus one over here we can perform
5:30 one or you can perform four or you can
5:33 perform five in this case so now uh we
5:36 know the Q learning equation that is Q
5:38 of for current state comma action
5:40 current status or initial state is one
5:43 comma action what we have selected five
5:45 uh that is the action over here R of
5:48 State
5:49 comma action state is one action is 5
5:52 here so R of 1 comma 5 is how much
5:56 hundred so that will become over here
5:58 gamma is the learning rate multiplied by
6:01 maximum of very important Point Q of
6:04 next state comma all actions what is the
6:06 next state we have selected five here
6:07 what are the actions you can perform you
6:09 can perform one four or five so you have
6:12 to put it over here now uh Q of one
6:14 comma five R of 1 comma five plus point
6:18 eight that's a gamma maximum of Q of
6:21 next state that is 5 what all actions
6:23 can perform first time one second time
6:25 four third time five here so this
6:28 particular 5 comma one five comma four
6:31 five comma five in this particular Q
6:33 Matrix everything is 0 here so maximum
6:36 of 0 is 0 multiplied by point eight is
6:39 zero R of 1 comma five is how much R of
6:42 1 comma five is equal to 100 so this
6:44 will become 100 in this case Q of one
6:47 comma five one comma five will be
6:49 hundred in this particular case so that
6:51 is what you can notice in this case
6:53 now uh what has happened here is uh when
6:56 we were presenting this particular
6:58 initial State 1 and after applying that
7:01 particular Q learning algorithm we have
7:03 reached to the goal State here because
7:05 we have reached the goal State one
7:07 episode has finished now we need to
7:09 perform the same set of episodes for
7:11 each and every initial States over here
7:14 now uh what I will do is I will consider
7:16 the initial status 3 for the next
7:19 episode so if I consider 3 as the uh
7:22 initial State for the next episode you
7:25 can see here you can perform an action 1
7:27 or action 2 or action 4 because we have
7:30 some values here remaining all or minus
7:32 one so we cannot perform those
7:33 particular actions that is what I have
7:35 written here
7:36 now between these three we need to
7:39 select one action let us say that I will
7:42 select action one in this particular
7:43 case
7:44 so the current state or the initial
7:46 state is 3 and the next state is
7:49 equivalent to 1 in this particular case
7:50 from the initial State 2 1 we got the
7:54 immediate reward as zero in this case
7:56 that's a one more important point we
7:58 need to remember here
8:00 now once you select one as the next
8:02 state what all things you can perform
8:03 you can perform an action three or you
8:06 can perform an action 5 over here so
8:08 those are two things you can perform
8:09 here so we will put those body things in
8:12 this equation again the initial state is
8:13 3 and action we have selected is one so
8:16 three one over here gamma of Maximum
8:19 what is the next state we have selected
8:21 the next state we have selected is one
8:23 what all action we can perform is three
8:25 and five over here so if you put all
8:27 those things in this situation that is
8:29 three comma 1 R 3 comma one what is R3
8:31 comma 1 3 comma one what is the value
8:33 here zero so that is what I have written
8:35 here 0.8 is the gamma value Max Max is
8:39 nothing but Max of this particular part
8:41 so max of Q next state is what next
8:45 state we have selected one what are the
8:47 actions you can perform three and then 5
8:49 over here so 1 comma 3 and 1 comma five
8:53 one comma three is how much 1 comma 3 is
8:56 equal into zero one comma five is equal
8:58 to 100 here so 0 and 100 between these
9:02 two what is the maximum that is 100 here
9:04 so 100 multiplied by 0.8 is 80 and this
9:08 particular part is already 0 3 comma 1
9:10 you can see here 3 comma 1 is equal to 0
9:12 here so this will lead to 80 in this
9:15 particular case
9:16 the value of Q3 comma 1 you can see here
9:19 3 comma 1 is 0 initially now it has
9:22 become 80 in this particular case that
9:24 is 3 comma 1 is equal to 80 over here
9:27 now we need to perform this particular
9:30 episodes because we have done with only
9:32 two episode till now
9:34 uh the same thing has to be repeated
9:36 again and again once you do this
9:38 particular things again and again you
9:39 will come up with this particular final
9:41 Q Matrix over here now once you get this
9:44 particular queue Matrix you can trace
9:46 any sequence over here that is what is
9:48 possible uh let us assume that we are
9:51 present in the initial state two so we
9:53 will try to get the best optimal what
9:55 you can say that the path here so if
9:58 that is the case either you can draw
9:59 this particular diagram or a state
10:00 diagram or you can trace it over here
10:02 also now when you're present in this
10:04 particular 2 you can select one best
10:07 value from this particular row so right
10:10 now we have only one best value that is
10:12 64 which will lead to state number three
10:15 so when you are present in state number
10:17 three here we have a three possible
10:19 values 80 51 and 80 between these three
10:22 80 is the best value so either you can
10:24 select one here or you can select four
10:26 over here so if you select one it will
10:28 go this path if you select four it will
10:29 go to this particular path now if you
10:31 have selected one from one which is the
10:34 best possible path this is the five over
10:36 here so you can say that two to three
10:38 three two one one two five over here
10:40 similarly if you have selected four in
10:42 this particular case from 4 you can
10:44 select the best path as uh this one
10:46 that's the five is the next action you
10:47 can select so two to three three to four
10:50 and then four to uh five in this
10:52 particular casing
10:54 so this is how we can apply a q learning
10:57 algorithm to Any Given problem
10:58 definition
11:00 I hope the concept is clear if you like
11:02 the video do like and share with your
11:04 friends press the Subscribe button for
11:06 more videos press the Bell icon for
11:07 regular updates thank you for watching