YouTube Transcript: IEEE Conference Paper Presentation - Machine Learning Track
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This research explores the development of a generalized reinforcement learning agent capable of solving higher-order board states of Tic-Tac-Toe (beyond the standard 3x3) by adapting existing techniques and focusing on a "well-posed problem" framework.
uh no question from my side
very well thank you thank you
okay thank you sir so next in line for
hello
it's my uh video Audible
your visible on your voice is
great so I'm just starting my screen
share let me know if you can are able to
see it foreign
foreign [Music]
it's changing
yes it's changing okay great so
just to check that if the marker is
available on the screen that I am writing
writing
yes marker is also available great
so uh first off uh thank you everyone
for giving me this opportunity to
present my paper title generalize agent
for solving higher Port states of
tic-tac-toe using reinforcement learning
um so first off let me cover the agenda
for this
um that why this paper was introduced or
why this we are programming this paper
today so there are many techniques for
three by three boats Board of
tic-tac-toe for example some of them are
listed over here as well so Q learning
min max algorithm and ql algorithm which
are implemented and also explored in the
past for solving three by three states
of tic-tac-toe but in this study uh what
I'm trying to explore is can we make
these studies more generic can we have
like a single algorithm with little bit
of tweaks here and there and present it
to the higher board sets of tic-tac-toe
which is three by three four by four or
five by five so this is very a beginning
study which is morally kind of an
exploration phase and the results which
will be shared in the end of what the
acceleration and what the techniques use
for this ideology was as well
so uh the existing system I like to take
you back to 2018 when uh Google deepmind
came in with its uh state-of-the-art
alphago so alphago is a very uh Advanced
reinforcement learning uh algorithm that
they've made which could beat even the
highest of these stock engines that we
had stockfish but the limitation for
that was that the source code was never
made public only your 20 games were
released and the just audience is just
pondering over what the algorithm could
look like the latest that we have gone
is Leela chess which is a community
based uh
um you could say reinforcement learning
board games playing simulation but that
is also very Community Based and we can
only make approximations at this point
so this is also an attempt to that
so uh for select uh this is actually
defined as a well post problem so
whenever we look at board board games
they're always I am trying to Define
them as well post problems so well close
problems have five characteristics at
Main so there is a task
so task is clear playing the game of
Tic-tac-toe and there is a performance
metric which is the number of games won
by the agent and there is also
experience as well which is the number
of games played by the agent against itself
itself
so how the bill post learning uh problem
statement is defined if you can do a
task p
and with the experience e the
performance is going up that is
explained as a well post problem in the
world of board games so
so
but defining a board game I like to take
you back to Five Points here so how the
algorithm is defined here so choosing
firstly training experience that we just
discussed about so we can have the
direct training experience or indirect
training experience direct training
experiences when you have a board state
for example this is a sample board stick
x0 and you give alongside the best move
for this alongside as well so x0 and for
example the best move is here trying to
make the cross here this is direct
feedback uh the problem with direct
feedback is that you need to implore the
rules of the game alongside it
but in this ideology indirect feedback
is employed that only the rules of the
game is given rest all that if a state
is winning the agent have to decide for itself
itself
now next is for the algorithm is
selecting the target function so we are
we are having that if the agent wins the
state this is a one state we're giving
the utility value of plus hundred if the
agent lost the state media giving the
utility value of plus minus hundred and
if it there is a game drawn the Italy
value will be of that board so it will
the target function of the final board
state so what happens if we are in the
intermediate of one of these board
States for example this game is neither
end nor draw this is in between so how
do we compute the utility value for this
port States
so for this this is where the variable
part comes into the picture as well that
some minority Peaks are required for
going to the higher board state so for
the three by three tic-tac-toe these are
the board states that we have defined
like these are the categories that we
have taken
for a kind of intermediating that if a
board set is preferable or not so let me
just zoom over here and take you through
this slide
so we are taking the features as as the following
following
uh we are taking X1 as a utility X1 as a
feature where we are checking how many
X's are in a row and how many X's are in
a column if there is single X in a row
then we are counting how many of these
are in The Matrix similarly is X2 we are
checking how many uh O's are in a row
single row OS and for X3 we are checking
how many consecutive X and in a row
consecutively we are also checking how
many os are in a row so like this we
have defined six parameters out of these
six parameters we have also have ooo and
excess exercise that is The Winning
State for the agent
using this we are also Computing the
intermediate training values for this so
we have defined all the feature vectors
here and these are all the weight
vectors that will come out of the
training if you uh do a DOT product of
them we have defined a linear function
that is used for calculating the
intermediate board set what is the
probability of having a win in that state
state
mind you that this function can also be
quadratic and cubic as well for
employing the Simplicity of the
algorithm because this is very starting
State we are keeping it as linear
so I hope that is clear
uh now let's go back to the presentation slides
slides
now the target function is defined and
we have also defined a little bit of
algorithm and the feature Vector that we
are going to use
uh this is just one of the uh slides we
are presenting that this is all is
presenting the x0 feature vector and
this is all presenting the X1 feature
Vector no this is the X6 feature Vector
the last one
it is very logical to sense that the
agent should try to minimize this state
which is 0 in all and maximize this x
assuming that the agent is playing X and
is playing first okay
yeah so I think methodology used in this
case is of a well-posed problem and
there are four components in this which
is an experiment generator for in this
case experimentary is what gives us the
uh new board size to try out for in this
case we are given we are only having a
three by three tic-tac-toe which is
empty in this you can also have a
randomized state but for the Simplicity
purposes we are having an empty board
the second uh
um in second group of the algorithm or
the architecture is a performance system
so what a performance system does
whatever state you got it will make the
best move according to that state
whatever feature Vector you have
currently and will give the solution
trace of the game tree so what game was
playing and what the end result was that
is the performance system for you is a Critic
Critic
I think what it creating does is
includes the uh
uh feature Vector values on to those
board States so now that you have a
training example so this also gives me
the opportunity to explain to you what
is the training example
the games played against in itself are
represented in the form of a training
example so training example is a you
could say a vector of 2 which will have
the board state for example this is a
board state of sample board set
and it is currently favoring X in this
situation that because X is going to win
in this and there you have the utility
value for this as well for example the
utility a value came out for this as 52.
uh in the favor of X so this is a
training experience that the agent is using
using
after you got all the uh training
examples with you with the utility
values you will try to hypothesize what
the ideal
Target function should look like what
the ideal weight Vector should look like
so this is the uh role of generalizer
once you have a some idea of how the
feature vectors should look like there
again is a new experiment generated then
there is again game played then the
again feature vectors are optimized
using least mean Square algorithm at
least mean Square optimization technique
at least mean Square could also be
termed as the stock stoichiastic
gradient descent optimization so you
have a stoichiasic trying to guide the
gradient Descent of the algorithm so
this you can just uh explore this as
many times as one you can play as many
games as you want which we'll be sharing
so this is the class diagram for the
proposed architecture you have which was
implemented for this uh you have the
experiment generator uh you got the
generalizer you have the rows of the
performance system variable defined and
you have got the global
player with the moves defined that it
will be played the function that it can
utilize and you have the function of
critics as well because of the uh lack
of time will not be going deep into it
but we have listed down all the uh
function that we are currently utilizing here
so this is the overall algorithm that we
are employing uh mind you that the input
is the number of training samples so
number of training sample is how much
how many games you want to train the
agent over do you want to train it over
100 of course the accuracy will be less
or do you want to train it over 50 000
so this is the input that you have to
give to the agent of course number of
games played will increase the accuracy
of the agent but down the line for this
algorithm we are initializing the
weight Vector as 0.5 0.5 for all the
feature defined
with the number of the features that we
defined earlier uh the number of x's in
and the number of x's in zero line so
those are all defined to be 0.5 in the
initial case
any games count is zero because we have apologies
and number of wins loss and draws we are
also counting at the end so this is the
uh above architecture only just playing
the game against himself optimizing the
weight values at the end you return the
weight Vector the final optimization
weight Vector the number of wins lose
and draws now we'll be comparing the
results after the number of gains still
and the results that we got
so this is a table explaining the
results that we got if you train the
agent over a thousand games the number
of wins uh
got was 761 82 and 157. out of it the
win draw ratio was 4.85 so for window
ratio is the Matrix that you can check
how the ident is performing for you and
these similar results for the other
games played as well a thousand hundred
thousand twenty thousand and we go on
until 150 000 and the number of wins
lost and browser came in here and the
thing to note here is that window ratio
is continuously uh improving as you play
the number of games and the feature
Vector we have can also observe here as
the games were played mind you you can
observe that the agent is trying to make
sense of the W6 uh feature Vector that
we Define which is number of O's in one line
line
so it is trying to minimize it so you
can say that minus 115 was the initial
one then it is trying to minimize it
with minus 2008 minus 175 so the agent
is trying to negate that possibility all
together it is trying to destroy
whatever the um opposing agent is trying
to do it to him
so this is just a plot uh
um giving us an overview of what the
previous slide showed you can see that
the with the number of games played the
win draw ratio is improving uh you can
say almost linearly and this is the
implication of the num
copy weight vectors with the number of
games played you can see uh the weight
Vector defined here W6 which is down the
line very constantly so it is just
remaining down the line and the w0 and
that the 0 should be in one place is at
so in the conclusion that this is a very
beginning study and this attempts to
give a base implementation to idea of
generic solvability of tic-tac-toe so
the win draw ratio is also very
plausible in this that we can see a
linear curve that it's improving with
time so how to
how can we extend it to the future board
stress is that if you have the feature
Vector defined here that you need to
make very a few tweaks in your code
now let me just bring up that slide again
yeah you need to make very a few tweaks
in the feature Vector that you are
considering for extending it to higher
board State as well with we can observe
that only with the linear approximation
as well we are getting very plausible
results in this so higher version will
have more feature vectors to this and we
also found in the study that we all
didn't include in the conclusion some of
these feature vectors might not be
required as we went further down the
line so this is just a very beginning
study on this
um uh that how can we make the agent
more generic uh that is the end
um if you have any questions do let me
know if you or any want any points to be
explained further do let me know uh that
was all about for the presentation and I
think the rest of the slide is oh yeah
there is an appendix as well uh that
while the game will train we were
training the agent these were some of
the sampled games plays against systems
so you can see the figure 19 on your side
side
that there was an X plate so this is
just for reference purposes that there
is a after the 500 000 games played that
the agent resulted in a win in this
situation and the Agents literally
didn't draw in the another case so this
is just the work of
um performance uh metrics and the
critics at your site so this is just for
reference purposes that we have added
here because of the limitation of the
number of slides allowed so do let me
know so I'm open to questions now thank you
which Optimizer you have used in your
presentation I mean it's implementation
so we have used the least mean Square
Optimizer so the lease mean Square
optimization looks something like this
so I have noted it down here as well so
if we have a feature Vector wi you can
optimize it using the existing Value Plus
Plus
ETA so it is the training rate so it is
usually 0.40.3 in our case and you have
V train and minus
V cap d into x i so x i is the feature
Vector value and this is the error in
your case in our case because of the
least mean square error this came like
this where V train vitamin so V train is
the ideal scenario and VK V cap is our
approximation scenario so this is how
the um
optimization is happening at the end of
the stage using least mean Square okay
uh have you implemented Adam optimizer
yeah that is the possibility sir uh as I
already mentioned uh we have only tried
because this is very beginning to the
phase of the exploring of generalized
system uh we are only exploring least
mean square and we have also kept the uh
learning function to be very simple
which is a linear one atom Optimizer is
also one of the ways that we can explore
it with uh but haven't been tried in
this study yet okay okay no problem okay good
are able to hear me uh yes Mr Kunal okay
so I have a few questions from uh from
you I want to ask you whether Lee space
query is Optimizer or what is the loss function
function
uh is there a loss function basically
sir so uh least mean Square uh looks
something like this um
um
either a loss function you said it is a
loss function right so what Optimizer
you have used in your research
so it says a stoichiastic based
you're talking about stochastic optimization
optimization
Optimizer so uh can you name that optimizer
optimizer
that you have used in your word yes sir
uh story gradient sensor
graduation okay so uh which package you
have used for implementation of this
the whole code was made from scratch only
only
um I haven't used any package for this
okay you haven't used you have written
uh everything I think only the python
code uh even the if mind yourself the
even the least mean square and the error
codes are the error logic which is
sounds like something like V train minus
the initial one square is also written
from scratch
okay okay so you have written everything
from scratch so have you used Marco
decision process somewhere
uh no sir we haven't
okay no other questions thank you so uh
sessions do you have any other further questions
uh muscle I think he is very well
explained all the queries
thank you no question
thank you Mr kushal and apologies for
not naming you getting good
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.