0:03 uh no question from my side
0:07 very well thank you thank you
0:11 okay thank you sir so next in line for
0:20 hello
0:23 it's my uh video Audible
0:26 your visible on your voice is
0:29 great so I'm just starting my screen
0:31 share let me know if you can are able to
0:32 see it foreign
0:34 foreign [Music]
0:54 it's changing
0:57 yes it's changing okay great so
1:09 just to check that if the marker is
1:11 available on the screen that I am writing
1:12 writing
1:17 yes marker is also available great
1:19 so uh first off uh thank you everyone
1:21 for giving me this opportunity to
1:23 present my paper title generalize agent
1:25 for solving higher Port states of
1:29 tic-tac-toe using reinforcement learning
1:32 um so first off let me cover the agenda
1:33 for this
1:36 um that why this paper was introduced or
1:37 why this we are programming this paper
1:40 today so there are many techniques for
1:42 three by three boats Board of
1:44 tic-tac-toe for example some of them are
1:46 listed over here as well so Q learning
1:49 min max algorithm and ql algorithm which
1:51 are implemented and also explored in the
1:53 past for solving three by three states
1:57 of tic-tac-toe but in this study uh what
2:00 I'm trying to explore is can we make
2:02 these studies more generic can we have
2:05 like a single algorithm with little bit
2:07 of tweaks here and there and present it
2:09 to the higher board sets of tic-tac-toe
2:11 which is three by three four by four or
2:13 five by five so this is very a beginning
2:15 study which is morally kind of an
2:18 exploration phase and the results which
2:19 will be shared in the end of what the
2:21 acceleration and what the techniques use
2:24 for this ideology was as well
2:28 so uh the existing system I like to take
2:31 you back to 2018 when uh Google deepmind
2:34 came in with its uh state-of-the-art
2:39 alphago so alphago is a very uh Advanced
2:41 reinforcement learning uh algorithm that
2:43 they've made which could beat even the
2:46 highest of these stock engines that we
2:49 had stockfish but the limitation for
2:51 that was that the source code was never
2:53 made public only your 20 games were
2:55 released and the just audience is just
2:57 pondering over what the algorithm could
2:59 look like the latest that we have gone
3:01 is Leela chess which is a community
3:04 based uh
3:05 um you could say reinforcement learning
3:08 board games playing simulation but that
3:10 is also very Community Based and we can
3:12 only make approximations at this point
3:15 so this is also an attempt to that
3:18 so uh for select uh this is actually
3:20 defined as a well post problem so
3:23 whenever we look at board board games
3:26 they're always I am trying to Define
3:29 them as well post problems so well close
3:32 problems have five characteristics at
3:35 Main so there is a task
3:37 so task is clear playing the game of
3:39 Tic-tac-toe and there is a performance
3:41 metric which is the number of games won
3:43 by the agent and there is also
3:45 experience as well which is the number
3:48 of games played by the agent against itself
3:49 itself
3:51 so how the bill post learning uh problem
3:54 statement is defined if you can do a
3:55 task p
3:58 and with the experience e the
3:59 performance is going up that is
4:03 explained as a well post problem in the
4:05 world of board games so
4:06 so
4:09 but defining a board game I like to take
4:11 you back to Five Points here so how the
4:14 algorithm is defined here so choosing
4:16 firstly training experience that we just
4:19 discussed about so we can have the
4:22 direct training experience or indirect
4:24 training experience direct training
4:26 experiences when you have a board state
4:28 for example this is a sample board stick
4:32 x0 and you give alongside the best move
4:36 for this alongside as well so x0 and for
4:38 example the best move is here trying to
4:40 make the cross here this is direct
4:42 feedback uh the problem with direct
4:45 feedback is that you need to implore the
4:48 rules of the game alongside it
4:50 but in this ideology indirect feedback
4:53 is employed that only the rules of the
4:55 game is given rest all that if a state
4:57 is winning the agent have to decide for itself
4:58 itself
5:01 now next is for the algorithm is
5:03 selecting the target function so we are
5:06 we are having that if the agent wins the
5:09 state this is a one state we're giving
5:11 the utility value of plus hundred if the
5:13 agent lost the state media giving the
5:16 utility value of plus minus hundred and
5:18 if it there is a game drawn the Italy
5:19 value will be of that board so it will
5:27 the target function of the final board
5:28 state so what happens if we are in the
5:30 intermediate of one of these board
5:34 States for example this game is neither
5:37 end nor draw this is in between so how
5:40 do we compute the utility value for this
5:41 port States
5:43 so for this this is where the variable
5:45 part comes into the picture as well that
5:48 some minority Peaks are required for
5:49 going to the higher board state so for
5:51 the three by three tic-tac-toe these are
5:53 the board states that we have defined
5:57 like these are the categories that we
5:58 have taken
6:02 for a kind of intermediating that if a
6:04 board set is preferable or not so let me
6:06 just zoom over here and take you through
6:08 this slide
6:10 so we are taking the features as as the following
6:11 following
6:15 uh we are taking X1 as a utility X1 as a
6:18 feature where we are checking how many
6:22 X's are in a row and how many X's are in
6:24 a column if there is single X in a row
6:25 then we are counting how many of these
6:30 are in The Matrix similarly is X2 we are
6:32 checking how many uh O's are in a row
6:36 single row OS and for X3 we are checking
6:38 how many consecutive X and in a row
6:40 consecutively we are also checking how
6:43 many os are in a row so like this we
6:45 have defined six parameters out of these
6:48 six parameters we have also have ooo and
6:50 excess exercise that is The Winning
6:51 State for the agent
6:55 using this we are also Computing the
6:57 intermediate training values for this so
7:00 we have defined all the feature vectors
7:01 here and these are all the weight
7:03 vectors that will come out of the
7:06 training if you uh do a DOT product of
7:08 them we have defined a linear function
7:10 that is used for calculating the
7:12 intermediate board set what is the
7:14 probability of having a win in that state
7:15 state
7:17 mind you that this function can also be
7:20 quadratic and cubic as well for
7:22 employing the Simplicity of the
7:23 algorithm because this is very starting
7:26 State we are keeping it as linear
7:28 so I hope that is clear
7:30 uh now let's go back to the presentation slides
7:32 slides
7:34 now the target function is defined and
7:36 we have also defined a little bit of
7:39 algorithm and the feature Vector that we
7:40 are going to use
7:43 uh this is just one of the uh slides we
7:45 are presenting that this is all is
7:48 presenting the x0 feature vector and
7:51 this is all presenting the X1 feature
7:53 Vector no this is the X6 feature Vector
7:55 the last one
7:57 it is very logical to sense that the
7:59 agent should try to minimize this state
8:02 which is 0 in all and maximize this x
8:05 assuming that the agent is playing X and
8:07 is playing first okay
8:16 yeah so I think methodology used in this
8:19 case is of a well-posed problem and
8:21 there are four components in this which
8:22 is an experiment generator for in this
8:25 case experimentary is what gives us the
8:27 uh new board size to try out for in this
8:30 case we are given we are only having a
8:31 three by three tic-tac-toe which is
8:33 empty in this you can also have a
8:35 randomized state but for the Simplicity
8:38 purposes we are having an empty board
8:41 the second uh
8:44 um in second group of the algorithm or
8:46 the architecture is a performance system
8:48 so what a performance system does
8:51 whatever state you got it will make the
8:53 best move according to that state
8:54 whatever feature Vector you have
8:56 currently and will give the solution
8:59 trace of the game tree so what game was
9:01 playing and what the end result was that
9:04 is the performance system for you is a Critic
9:05 Critic
9:07 I think what it creating does is
9:09 includes the uh
9:13 uh feature Vector values on to those
9:16 board States so now that you have a
9:18 training example so this also gives me
9:19 the opportunity to explain to you what
9:22 is the training example
9:24 the games played against in itself are
9:27 represented in the form of a training
9:29 example so training example is a you
9:32 could say a vector of 2 which will have
9:34 the board state for example this is a
9:36 board state of sample board set
9:39 and it is currently favoring X in this
9:40 situation that because X is going to win
9:42 in this and there you have the utility
9:44 value for this as well for example the
9:47 utility a value came out for this as 52.
9:50 uh in the favor of X so this is a
9:52 training experience that the agent is using
9:54 using
9:57 after you got all the uh training
9:58 examples with you with the utility
10:01 values you will try to hypothesize what
10:03 the ideal
10:04 Target function should look like what
10:07 the ideal weight Vector should look like
10:10 so this is the uh role of generalizer
10:13 once you have a some idea of how the
10:15 feature vectors should look like there
10:18 again is a new experiment generated then
10:20 there is again game played then the
10:22 again feature vectors are optimized
10:24 using least mean Square algorithm at
10:26 least mean Square optimization technique
10:28 at least mean Square could also be
10:32 termed as the stock stoichiastic
10:34 gradient descent optimization so you
10:36 have a stoichiasic trying to guide the
10:38 gradient Descent of the algorithm so
10:41 this you can just uh explore this as
10:43 many times as one you can play as many
10:45 games as you want which we'll be sharing
10:53 so this is the class diagram for the
10:56 proposed architecture you have which was
10:58 implemented for this uh you have the
11:01 experiment generator uh you got the
11:03 generalizer you have the rows of the
11:05 performance system variable defined and
11:08 you have got the global
11:10 player with the moves defined that it
11:13 will be played the function that it can
11:15 utilize and you have the function of
11:19 critics as well because of the uh lack
11:20 of time will not be going deep into it
11:23 but we have listed down all the uh
11:24 function that we are currently utilizing here
11:31 so this is the overall algorithm that we
11:35 are employing uh mind you that the input
11:37 is the number of training samples so
11:39 number of training sample is how much
11:40 how many games you want to train the
11:42 agent over do you want to train it over
11:44 100 of course the accuracy will be less
11:47 or do you want to train it over 50 000
11:48 so this is the input that you have to
11:50 give to the agent of course number of
11:53 games played will increase the accuracy
11:56 of the agent but down the line for this
12:00 algorithm we are initializing the
12:04 weight Vector as 0.5 0.5 for all the
12:05 feature defined
12:07 with the number of the features that we
12:10 defined earlier uh the number of x's in
12:14 and the number of x's in zero line so
12:16 those are all defined to be 0.5 in the
12:18 initial case
12:22 any games count is zero because we have apologies
12:30 and number of wins loss and draws we are
12:32 also counting at the end so this is the
12:35 uh above architecture only just playing
12:37 the game against himself optimizing the
12:40 weight values at the end you return the
12:42 weight Vector the final optimization
12:44 weight Vector the number of wins lose
12:47 and draws now we'll be comparing the
12:50 results after the number of gains still
12:52 and the results that we got
12:54 so this is a table explaining the
12:56 results that we got if you train the
12:58 agent over a thousand games the number
13:00 of wins uh
13:05 got was 761 82 and 157. out of it the
13:07 win draw ratio was 4.85 so for window
13:10 ratio is the Matrix that you can check
13:13 how the ident is performing for you and
13:16 these similar results for the other
13:18 games played as well a thousand hundred
13:19 thousand twenty thousand and we go on
13:22 until 150 000 and the number of wins
13:24 lost and browser came in here and the
13:26 thing to note here is that window ratio
13:29 is continuously uh improving as you play
13:32 the number of games and the feature
13:34 Vector we have can also observe here as
13:37 the games were played mind you you can
13:40 observe that the agent is trying to make
13:44 sense of the W6 uh feature Vector that
13:46 we Define which is number of O's in one line
13:48 line
13:50 so it is trying to minimize it so you
13:52 can say that minus 115 was the initial
13:53 one then it is trying to minimize it
13:56 with minus 2008 minus 175 so the agent
13:58 is trying to negate that possibility all
14:00 together it is trying to destroy
14:04 whatever the um opposing agent is trying
14:06 to do it to him
14:09 so this is just a plot uh
14:12 um giving us an overview of what the
14:14 previous slide showed you can see that
14:16 the with the number of games played the
14:19 win draw ratio is improving uh you can
14:22 say almost linearly and this is the
14:25 implication of the num
14:28 copy weight vectors with the number of
14:30 games played you can see uh the weight
14:34 Vector defined here W6 which is down the
14:36 line very constantly so it is just
14:40 remaining down the line and the w0 and
14:42 that the 0 should be in one place is at
14:48 so in the conclusion that this is a very
14:50 beginning study and this attempts to
14:52 give a base implementation to idea of
14:55 generic solvability of tic-tac-toe so
14:57 the win draw ratio is also very
14:59 plausible in this that we can see a
15:00 linear curve that it's improving with
15:03 time so how to
15:06 how can we extend it to the future board
15:09 stress is that if you have the feature
15:12 Vector defined here that you need to
15:15 make very a few tweaks in your code
15:17 now let me just bring up that slide again
15:23 yeah you need to make very a few tweaks
15:25 in the feature Vector that you are
15:27 considering for extending it to higher
15:29 board State as well with we can observe
15:31 that only with the linear approximation
15:33 as well we are getting very plausible
15:36 results in this so higher version will
15:38 have more feature vectors to this and we
15:40 also found in the study that we all
15:42 didn't include in the conclusion some of
15:43 these feature vectors might not be
15:45 required as we went further down the
15:48 line so this is just a very beginning
15:49 study on this
15:51 um uh that how can we make the agent
15:55 more generic uh that is the end
15:57 um if you have any questions do let me
16:01 know if you or any want any points to be
16:04 explained further do let me know uh that
16:06 was all about for the presentation and I
16:08 think the rest of the slide is oh yeah
16:11 there is an appendix as well uh that
16:13 while the game will train we were
16:15 training the agent these were some of
16:17 the sampled games plays against systems
16:19 so you can see the figure 19 on your side
16:19 side
16:22 that there was an X plate so this is
16:24 just for reference purposes that there
16:27 is a after the 500 000 games played that
16:29 the agent resulted in a win in this
16:31 situation and the Agents literally
16:33 didn't draw in the another case so this
16:35 is just the work of
16:37 um performance uh metrics and the
16:39 critics at your site so this is just for
16:41 reference purposes that we have added
16:42 here because of the limitation of the
16:45 number of slides allowed so do let me
16:47 know so I'm open to questions now thank you
17:03 which Optimizer you have used in your
17:06 presentation I mean it's implementation
17:08 so we have used the least mean Square
17:11 Optimizer so the lease mean Square
17:13 optimization looks something like this
17:15 so I have noted it down here as well so
17:18 if we have a feature Vector wi you can
17:20 optimize it using the existing Value Plus
17:22 Plus
17:25 ETA so it is the training rate so it is
17:28 usually 0.40.3 in our case and you have
17:32 V train and minus
17:37 V cap d into x i so x i is the feature
17:39 Vector value and this is the error in
17:41 your case in our case because of the
17:43 least mean square error this came like
17:47 this where V train vitamin so V train is
17:50 the ideal scenario and VK V cap is our
17:52 approximation scenario so this is how
17:53 the um
17:56 optimization is happening at the end of
17:59 the stage using least mean Square okay
18:04 uh have you implemented Adam optimizer
18:07 yeah that is the possibility sir uh as I
18:09 already mentioned uh we have only tried
18:11 because this is very beginning to the
18:13 phase of the exploring of generalized
18:16 system uh we are only exploring least
18:18 mean square and we have also kept the uh
18:19 learning function to be very simple
18:21 which is a linear one atom Optimizer is
18:23 also one of the ways that we can explore
18:26 it with uh but haven't been tried in
18:29 this study yet okay okay no problem okay good
18:37 are able to hear me uh yes Mr Kunal okay
18:40 so I have a few questions from uh from
18:42 you I want to ask you whether Lee space
18:44 query is Optimizer or what is the loss function
18:46 function
18:48 uh is there a loss function basically
18:51 sir so uh least mean Square uh looks
18:52 something like this um
18:53 um
18:55 either a loss function you said it is a
18:58 loss function right so what Optimizer
19:01 you have used in your research
19:04 so it says a stoichiastic based
19:11 you're talking about stochastic optimization
19:13 optimization
19:16 Optimizer so uh can you name that optimizer
19:18 optimizer
19:20 that you have used in your word yes sir
19:23 uh story gradient sensor
19:27 graduation okay so uh which package you
19:29 have used for implementation of this
19:41 the whole code was made from scratch only
19:42 only
19:46 um I haven't used any package for this
19:48 okay you haven't used you have written
19:51 uh everything I think only the python
19:54 code uh even the if mind yourself the
19:56 even the least mean square and the error
19:58 codes are the error logic which is
20:00 sounds like something like V train minus
20:03 the initial one square is also written
20:04 from scratch
20:06 okay okay so you have written everything
20:08 from scratch so have you used Marco
20:10 decision process somewhere
20:12 uh no sir we haven't
20:16 okay no other questions thank you so uh
20:18 sessions do you have any other further questions
20:23 uh muscle I think he is very well
20:26 explained all the queries
20:29 thank you no question
20:31 thank you Mr kushal and apologies for
20:34 not naming you getting good