YouTube Transcript:
IEEE Conference Paper Presentation - Machine Learning Track

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This research explores the development of a generalized reinforcement learning agent capable of solving higher-order board states of Tic-Tac-Toe (beyond the standard 3x3) by adapting existing techniques and focusing on a "well-posed problem" framework.

uh no question from my side

very well thank you thank you

okay thank you sir so next in line for

hello

it's my uh video Audible

your visible on your voice is

great so I'm just starting my screen

share let me know if you can are able to

see it foreign

foreign [Music]

it's changing

yes it's changing okay great so

just to check that if the marker is

available on the screen that I am writing

writing

yes marker is also available great

so uh first off uh thank you everyone

for giving me this opportunity to

present my paper title generalize agent

for solving higher Port states of

tic-tac-toe using reinforcement learning

um so first off let me cover the agenda

for this

um that why this paper was introduced or

why this we are programming this paper

today so there are many techniques for

three by three boats Board of

tic-tac-toe for example some of them are

listed over here as well so Q learning

min max algorithm and ql algorithm which

are implemented and also explored in the

past for solving three by three states

of tic-tac-toe but in this study uh what

I'm trying to explore is can we make

these studies more generic can we have

like a single algorithm with little bit

of tweaks here and there and present it

to the higher board sets of tic-tac-toe

which is three by three four by four or

five by five so this is very a beginning

study which is morally kind of an

exploration phase and the results which

will be shared in the end of what the

acceleration and what the techniques use

for this ideology was as well

so uh the existing system I like to take

you back to 2018 when uh Google deepmind

came in with its uh state-of-the-art

alphago so alphago is a very uh Advanced

reinforcement learning uh algorithm that

they've made which could beat even the

highest of these stock engines that we

had stockfish but the limitation for

that was that the source code was never

made public only your 20 games were

released and the just audience is just

pondering over what the algorithm could

look like the latest that we have gone

is Leela chess which is a community

based uh

um you could say reinforcement learning

board games playing simulation but that

is also very Community Based and we can

only make approximations at this point

so this is also an attempt to that

so uh for select uh this is actually

defined as a well post problem so

whenever we look at board board games

they're always I am trying to Define

them as well post problems so well close

problems have five characteristics at

Main so there is a task

so task is clear playing the game of

Tic-tac-toe and there is a performance

metric which is the number of games won

by the agent and there is also

experience as well which is the number

of games played by the agent against itself

itself

so how the bill post learning uh problem

statement is defined if you can do a

task p

and with the experience e the

performance is going up that is

explained as a well post problem in the

world of board games so

but defining a board game I like to take

you back to Five Points here so how the

algorithm is defined here so choosing

firstly training experience that we just

discussed about so we can have the

direct training experience or indirect

training experience direct training

experiences when you have a board state

for example this is a sample board stick

x0 and you give alongside the best move

for this alongside as well so x0 and for

example the best move is here trying to

make the cross here this is direct

feedback uh the problem with direct

feedback is that you need to implore the

rules of the game alongside it

but in this ideology indirect feedback

is employed that only the rules of the

game is given rest all that if a state

is winning the agent have to decide for itself

itself

now next is for the algorithm is

selecting the target function so we are

we are having that if the agent wins the

state this is a one state we're giving

the utility value of plus hundred if the

agent lost the state media giving the

utility value of plus minus hundred and

if it there is a game drawn the Italy

value will be of that board so it will

the target function of the final board

state so what happens if we are in the

intermediate of one of these board

States for example this game is neither

end nor draw this is in between so how

do we compute the utility value for this

port States

so for this this is where the variable

part comes into the picture as well that

some minority Peaks are required for

going to the higher board state so for

the three by three tic-tac-toe these are

the board states that we have defined

like these are the categories that we

have taken

for a kind of intermediating that if a

board set is preferable or not so let me

just zoom over here and take you through

this slide

so we are taking the features as as the following

following

uh we are taking X1 as a utility X1 as a

feature where we are checking how many

X's are in a row and how many X's are in

a column if there is single X in a row

then we are counting how many of these

are in The Matrix similarly is X2 we are

checking how many uh O's are in a row

single row OS and for X3 we are checking

how many consecutive X and in a row

consecutively we are also checking how

many os are in a row so like this we

have defined six parameters out of these

six parameters we have also have ooo and

excess exercise that is The Winning

State for the agent

using this we are also Computing the

intermediate training values for this so

we have defined all the feature vectors

here and these are all the weight

vectors that will come out of the

training if you uh do a DOT product of

them we have defined a linear function

that is used for calculating the

intermediate board set what is the

probability of having a win in that state

state

mind you that this function can also be

quadratic and cubic as well for

employing the Simplicity of the

algorithm because this is very starting

State we are keeping it as linear

so I hope that is clear

uh now let's go back to the presentation slides

slides

now the target function is defined and

we have also defined a little bit of

algorithm and the feature Vector that we

are going to use

uh this is just one of the uh slides we

are presenting that this is all is

presenting the x0 feature vector and

this is all presenting the X1 feature

Vector no this is the X6 feature Vector

the last one

it is very logical to sense that the

agent should try to minimize this state

which is 0 in all and maximize this x

assuming that the agent is playing X and

is playing first okay

yeah so I think methodology used in this

case is of a well-posed problem and

there are four components in this which

is an experiment generator for in this

case experimentary is what gives us the

uh new board size to try out for in this

case we are given we are only having a

three by three tic-tac-toe which is

empty in this you can also have a

randomized state but for the Simplicity

purposes we are having an empty board

the second uh

um in second group of the algorithm or

the architecture is a performance system

so what a performance system does

whatever state you got it will make the

best move according to that state

whatever feature Vector you have

currently and will give the solution

trace of the game tree so what game was

playing and what the end result was that

is the performance system for you is a Critic

Critic

I think what it creating does is

includes the uh

uh feature Vector values on to those

board States so now that you have a

training example so this also gives me

the opportunity to explain to you what

is the training example

the games played against in itself are

represented in the form of a training

example so training example is a you

could say a vector of 2 which will have

the board state for example this is a

board state of sample board set

and it is currently favoring X in this

situation that because X is going to win

in this and there you have the utility

value for this as well for example the

utility a value came out for this as 52.

uh in the favor of X so this is a

training experience that the agent is using

using

after you got all the uh training

examples with you with the utility

values you will try to hypothesize what

the ideal

Target function should look like what

the ideal weight Vector should look like

so this is the uh role of generalizer

once you have a some idea of how the

feature vectors should look like there

again is a new experiment generated then

there is again game played then the

again feature vectors are optimized

using least mean Square algorithm at

least mean Square optimization technique

at least mean Square could also be

termed as the stock stoichiastic

gradient descent optimization so you

have a stoichiasic trying to guide the

gradient Descent of the algorithm so

this you can just uh explore this as

many times as one you can play as many

games as you want which we'll be sharing

so this is the class diagram for the

proposed architecture you have which was

implemented for this uh you have the

experiment generator uh you got the

generalizer you have the rows of the

performance system variable defined and

you have got the global

player with the moves defined that it

will be played the function that it can

utilize and you have the function of

critics as well because of the uh lack

of time will not be going deep into it

but we have listed down all the uh

function that we are currently utilizing here

so this is the overall algorithm that we

are employing uh mind you that the input

is the number of training samples so

number of training sample is how much

how many games you want to train the

agent over do you want to train it over

100 of course the accuracy will be less

or do you want to train it over 50 000

so this is the input that you have to

give to the agent of course number of

games played will increase the accuracy

of the agent but down the line for this

algorithm we are initializing the

weight Vector as 0.5 0.5 for all the

feature defined

with the number of the features that we

defined earlier uh the number of x's in

and the number of x's in zero line so

those are all defined to be 0.5 in the

initial case

any games count is zero because we have apologies

and number of wins loss and draws we are

also counting at the end so this is the

uh above architecture only just playing

the game against himself optimizing the

weight values at the end you return the

weight Vector the final optimization

weight Vector the number of wins lose

and draws now we'll be comparing the

results after the number of gains still

and the results that we got

so this is a table explaining the

results that we got if you train the

agent over a thousand games the number

of wins uh

got was 761 82 and 157. out of it the

win draw ratio was 4.85 so for window

ratio is the Matrix that you can check

how the ident is performing for you and

these similar results for the other

games played as well a thousand hundred

thousand twenty thousand and we go on

until 150 000 and the number of wins

lost and browser came in here and the

thing to note here is that window ratio

is continuously uh improving as you play

the number of games and the feature

Vector we have can also observe here as

the games were played mind you you can

observe that the agent is trying to make

sense of the W6 uh feature Vector that

we Define which is number of O's in one line

line

so it is trying to minimize it so you

can say that minus 115 was the initial

one then it is trying to minimize it

with minus 2008 minus 175 so the agent

is trying to negate that possibility all

together it is trying to destroy

whatever the um opposing agent is trying

to do it to him

so this is just a plot uh

um giving us an overview of what the

previous slide showed you can see that

the with the number of games played the

win draw ratio is improving uh you can

say almost linearly and this is the

implication of the num

copy weight vectors with the number of

games played you can see uh the weight

Vector defined here W6 which is down the

line very constantly so it is just

remaining down the line and the w0 and

that the 0 should be in one place is at

so in the conclusion that this is a very

beginning study and this attempts to

give a base implementation to idea of

generic solvability of tic-tac-toe so

the win draw ratio is also very

plausible in this that we can see a

linear curve that it's improving with

time so how to

how can we extend it to the future board

stress is that if you have the feature

Vector defined here that you need to

make very a few tweaks in your code

now let me just bring up that slide again

yeah you need to make very a few tweaks

in the feature Vector that you are

considering for extending it to higher

board State as well with we can observe

that only with the linear approximation

as well we are getting very plausible

results in this so higher version will

have more feature vectors to this and we

also found in the study that we all

didn't include in the conclusion some of

these feature vectors might not be

required as we went further down the

line so this is just a very beginning

study on this

um uh that how can we make the agent

more generic uh that is the end

um if you have any questions do let me

know if you or any want any points to be

explained further do let me know uh that

was all about for the presentation and I

think the rest of the slide is oh yeah

there is an appendix as well uh that

while the game will train we were

training the agent these were some of

the sampled games plays against systems

so you can see the figure 19 on your side

side

that there was an X plate so this is

just for reference purposes that there

is a after the 500 000 games played that

the agent resulted in a win in this

situation and the Agents literally

didn't draw in the another case so this

is just the work of

um performance uh metrics and the

critics at your site so this is just for

reference purposes that we have added

here because of the limitation of the

number of slides allowed so do let me

know so I'm open to questions now thank you

which Optimizer you have used in your

presentation I mean it's implementation

so we have used the least mean Square

Optimizer so the lease mean Square

optimization looks something like this

so I have noted it down here as well so

if we have a feature Vector wi you can

optimize it using the existing Value Plus

Plus

ETA so it is the training rate so it is

usually 0.40.3 in our case and you have

V train and minus

V cap d into x i so x i is the feature

Vector value and this is the error in

your case in our case because of the

least mean square error this came like

this where V train vitamin so V train is

the ideal scenario and VK V cap is our

approximation scenario so this is how

the um

optimization is happening at the end of

the stage using least mean Square okay

uh have you implemented Adam optimizer

yeah that is the possibility sir uh as I

already mentioned uh we have only tried

because this is very beginning to the

phase of the exploring of generalized

system uh we are only exploring least

mean square and we have also kept the uh

learning function to be very simple

which is a linear one atom Optimizer is

also one of the ways that we can explore

it with uh but haven't been tried in

this study yet okay okay no problem okay good

are able to hear me uh yes Mr Kunal okay

so I have a few questions from uh from

you I want to ask you whether Lee space

query is Optimizer or what is the loss function

function

uh is there a loss function basically

sir so uh least mean Square uh looks

something like this um

either a loss function you said it is a

loss function right so what Optimizer

you have used in your research

so it says a stoichiastic based

you're talking about stochastic optimization

optimization

Optimizer so uh can you name that optimizer

optimizer

that you have used in your word yes sir

uh story gradient sensor

graduation okay so uh which package you

have used for implementation of this

the whole code was made from scratch only

only

um I haven't used any package for this

okay you haven't used you have written

uh everything I think only the python

code uh even the if mind yourself the

even the least mean square and the error

codes are the error logic which is

sounds like something like V train minus

the initial one square is also written

from scratch

okay okay so you have written everything

from scratch so have you used Marco

decision process somewhere

uh no sir we haven't

okay no other questions thank you so uh

sessions do you have any other further questions

uh muscle I think he is very well

explained all the queries

thank you no question

thank you Mr kushal and apologies for

not naming you getting good

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:IEEE Conference Paper Presentation - Machine Learning Track