Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Statistical Machine Learning Part 1 - Machine learning and inductive bias | Tübingen Machine Learning | YouTubeToText
YouTube Transcript: Statistical Machine Learning Part 1 - Machine learning and inductive bias
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This lecture introduces statistical machine learning by presenting motivating examples and defining machine learning as the process of automating inductive inference, emphasizing the necessity of inductive bias for any learning system to function.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
good morning everybody today is the
first lecture in statistical machine
learning and we would like to start this
lecture with a couple of motivating
examples the first example I want to
talk about is handwritten digit
recognition this is one of the founding
programs of machine learning and in the
1990s it was considered quite a lot
under the name of pattern recognition
the problem is as following assume
you're at the postal service and you
want to deliver a letter so the letter
comes on some moving belt you have a
camera that takes a picture of your
letter and now you want to automatically
recognize the address and the zip code
of the city where the letter is supposed
to go to so the problem is you have this
this photograph but now it's not so easy
to give a handwritten rule that says oh
look this this and these letters are an
F and this digit is a 7 and so on you
need to have some systems that are more
flexible and you don't want to hand this
and this rule by yourself but you want
the system to find out a rule that can
recognize these digits looking at it a
bit closer assume that we have a
photograph of a digit and like here on
the slide we have the digit 3 in this
case it's a 16 by 16 grayscale image so
each pixel in this image is like a pixel
is one of these little squares and it
has a grayscale value it is a number
between 0 & 1 0 means white one means
black and every number in between is
some some shade of gray and now what is
it that the computer sees about the
stitch about this digit it doesn't see
it as an image it sees it as a vector of
in this case 256 many numbers between 0
& 1 so here in the bottom of the slide
what we are supposed to do is now we
need to learn a function it takes as an
input such a number which has 256
entries between 0 & 1 and the out part
of this function is supposed to be the
digit that is represented by this
particular vector and this is one of the
founding programs in machine learning
and if you look at this this particular
slide you can already see that it's not
so simple we have different versions of
the digit 5 for the digit 9 the digit 7
and 1 and you all can already see that
it's quite easy to mix up the digits and
for example this 9 year looks a bit
atypical and this one might even be a 5
and here the difference between some of
these sevens and some of these ones is
also not so easy and here the idea is we
want to use machine learning to solve
this problem of handwritten digit
recognition and in one of the first
exercises in this class you're going to
solve this problem yourself by a very
simple algorithm in fact another problem
that's also quite old spam filtering in
the 1990s when emails came up and spam
maybe was not so much of a problem but
it soon became a problem everybody gets
all the spam emails and you want to
design a filter that automatically can
tell apart normal emails from spam
emails and again it might be easy to
give a couple of keywords that might
hint that there is a spam that a
particular email is a spam email but in
general this is not so easy and
handwritten rules often don't work very
well so what all the email programs have
internally they have a so-called spam
filter the idea is you get your incoming
emails and whenever you encounter a spam
email in your inbox you press the button
spam and in the background there's a
machine learning classifier that tries
to classify immerse into spam and
non-spam and whenever you press this
button this classifier is being updated
in an online fashion and hopefully in
this way the spam filter is always
up-to-date and can detect emails which
are spam emails and can separate them
this is a typical online learning
problem as opposed to the handwritten
digit problem where we train the machine
learning classifier once and forever and
then hopefully it can classify all the
digits here in the state recognition
nothing really changes digits are there
people have the handwriting but there is
no evolution over time whereas in spam
filtering of course it's a game between
yourself and your opponent who is the
person who wants to send the spam emails
so whenever you have updated your spam
filter the person is trying to invent
some new spam image and this is going on
going over time so in an online fashion
you want to solve the machine learning
problem always with the most up-to-date
tools and this is called an online
machine learning problem a very
important machine learning application
is object detection imagine self-driving
cars here you have a scene from some
from some road and traffic and you want
to recognize like the self-driving car
is supposed to recognize that there are
pedestrians there are other cars there
might be a traffic light or traffic some
other traffic on the cyclist and so on
and so the problem of object detection
is given a complex image like a scene
like this one you want to recognize what
is on the scene it is a more general
version of handwritten digit recognition
but it was much much more complicated
scenes and many many more types of
objects this problem was one of the
important problems that needed to be
solved for self-driving cars and
self-driving cars are one of the big one
of the big applications of machine
learning out there
in 2005 there was one like people have
been trying to build safe driving cars
since many many years and the first very
important breakthrough happened in 2005
when there was a race where cars were
supposed to drive autonomously through a
desert for a hundred kilometers and they
just got the GPS coordinates for like
they also were starting at the same
point and got the GPS coordinates of the
point where they were supposed to go and
for the first time in 2005 one single or
a couple of cars managed to solve this
challenge and from then on self-driving
cars really became much
more prominent and have been developed
and are now about to be rolled off in
many of the in for example in some
cities in the US you can already try
self-driving cars in Germany they are
not yet out there there are some third
some technically problems but there are
also some problems that come from the
law and from from responsibility one big
application area of machine learning
already since quite some years is the
field of bioinformatics there are many
many machine learning algorithms that
are applied to a wide variety of
problems for example you one of the
starting problems was you wanted to
detect different like different types of
diseases from microarray data so I'm not
a bio biologist but my understanding of
microarray data is you have a certain
cell and their proteins that might be
active or not and you have know some
kind of lab experiment that can measure
whether a certain protein is active or
not and in this little image that you
can see on this slide each of the blue
or green egg green or red dots stands
for this particular protein is active or
is not active now you want to classify
different types of cancer cells for
example based on this pattern so it's a
bit like in handwritten digit
recognition you have a matrix consisting
of zeros and one say green and red dots
and you want to say which of these
patterns belongs to a particular disease
because a cell behaves in a certain way
another application is drug discovery
where you want to say you have a disease
and you want to design a drug and to be
able to do that what you need to do you
have this protein maybe the virus and
you want to knock it out so what you
might want to do is you want to find a
small molecule that can bind to this
protein and then do certain things to
the protein so you need to first find a
molecule that can bind to the protein
and here on this slide you see an
example the a protein has a very
complicated three-dimensional structure
and it has these little pockets in a
three-dimensional structure and now you
need to find a molecule that exactly can
bind inside such a pocket and again this
would be very expensive to try in a lab
you have like thousands of different
molecules that might work might be
working but you don't want to run a lab expert
expert
and for all of them you might want to
speak to pre-screen the different
molecules and in order to do that and
you use machine learning again you have
a certain description of these pockets
it says how large is the pocket water
maybe the the molecules that sit at the
side of the pockets what are the binding
energies of all these molecules and
based on this description you want to
predict whether a certain molecule is
now going to fit into this pocket or not
this is again a classification problem
that you might be able to solve with
machine learning here just for the
people who work in bioinformatics I have
one slide that shows which are all the
different fields in which bio and in
which machine learning is used in
bioinformatics if you want you can look
at it at home going from bioinformatics
more towards medical applications one of
the very big fields also very prominent
currently in machine learning is
applications in medicine for example in
personalized medicine you want to hand
design different therapies to particular
genetic to the genetic disposition of
particular people or here I have an
example for skin to cancer detection
where the idea is you have a it's again
some kind of image detection object
detection problem the idea is you as a
person you think you have a funny piece
of color at your skin and you wonder
whether it's a it has to do with skin
cancer or not so what you do is you take
a do you take your smartphone you take a
photograph of the skin and then you use
an automatic classifier that might say
oh this is very harmful or this is this
might be harmful or this is not harmful
and then depending on the outcome you
can start consulting a doctor and the
impressive thing is that these systems
by now are at least at the accuracy of
medical experts who have really been
trained for years to detect different
types of skin cancer so machine learning
at this in this particular application
is really a very powerful tool that can
support doctors who then can focus more
there are many many more applications in
science and here I will just want to
outline one in which is a bit funny it's
an archaeology and you would think that
archaeology is maybe the last field
where machine learning could could be of
an advantage but here is a nice and
paper that has been published 1990 to
2019 so last year in Nature
communications where people have been
analyzing the human genome from ancient
findings and they found some evidence
that tried to reconstruct this
development tree in which different
different kinds of humans have have been
developing and they found that there is
there must be an additional branch in
this tree that has not been discovered
yet so we have not found any bone of
this particular branch in the human
development but it must be there because
otherwise you couldn't explain the data
that you currently have with if you
don't assume that such a trick such a
branch exists and this is I think a cool
application because it shows that
machine learning not only can solve very
specific classification problems but it
can really discover things that you
one of the fields where machine learning
is very powerful nowadays is language
processing one first breakthrough was in
2011 when the computer Watson won this
there was a question in the US that was
called yopo do you it is a bit like the
German we have a millionaire so there
are questions that have been asked and
then the persons on this case a computer
is supposed to answer and the
interesting thing here is that these
questions are more like word games it's
not so much about who won the soccer
championship in 1950 55 or so it's more
like kind of word games and as a
surprising thing was that this computer
Watson was able to beat the best
geo Purdue players at that time by now a
language processing is very very
prominent and you see you have cereal on
your phone or Alexa or you can also try
automatic translation systems like if
you haven't seen that before a deep L is
one of my favorite trance
Rajon services you paste an English
sentence and it spits out the perfect
German sentence or the other way around
and this is really impressive and this
wouldn't have been possible a couple of
years ago one last thing I want to
mention is I forego many of you might
have heard about it
so chess is an old play an old game
which has been solved by computers
already in 1996 at that time there was a
computer which was able to beat the
world champion
the world leader in chess playing at
that time gathering garlic has power
however at that time they didn't use any
machine learning for this essentially
what they did is a very clever search
procedure combined with a very very
powerful computer so essentially at this
at this time 1996 for chess they
essentially managed to look I had a
couple of steps and evaluate all the
different possibilities and the
different directions and the opponent
might have and in this in this fashion
managed to beat the best chess player
who might not have such a huge such a
huge computational power to look ahead
for say five steps now it's a very
different story with alphago when I
forego that was in 2016
did mind I managed to program or go
playing machines purely using machine
learning and that was really a big
breakthrough at the time sorry I don't
have a slide for this so what happened
is essentially you they used neural
networks to sort of represent the the
situation on the board and then they
they first fed in your network with
games that have been played by experts
to try to train it to do the same kind
of moves that experts have done and then
in the next step they led to different
systems of alphago play against each
other in order to improve and improve
and improve and in the end they managed
to beat this the world championship at
the time so now we've seen many examples
where machine learning plays an
important role but now what is machine
learning how can you define it is there
a definition at all or how could you
explain what happens in the background
of course I mean we're going to spend a
whole semester trying to discover it but
let's try to start with a couple of
if you look at what what is in Wikipedia
are in many of these online blocks they
try to explain machine learning you will
find something along those lines machine
learning is the development of
algorithms which allow a computer to
learn specific tasks from training
examples and there are a couple of words
that are really important here the first
one is specific tasks machine learning
is not or at least in my opinion it's
not about building general artificial
intelligence so you don't want to build
an agent that is like a robot that is
really intelligent as a human what we
try to do in machine learning is to
build algorithms that can solve very
specific tasks it could be skin
detection skin cancer detection or it
could be language translation what could
be to play alphago but you're not or at
least currently we are not trying to
build an agent that can do all these
tasks at the same time but whenever you
want your algorithm to do a new task you
need to train and this training for this
training typically you need training
examples so you need examples of the
tasks that you are supposed to - that
the computer is supposed to learn for
example unit in skin cancer detection
you need images of different pieces of
skin and then you need to have the labor
which says this is skin cancer and this
is normal skin now the next point is
learning means that the computer cannot
only memorize the scene examples but can
generalize to previously unseen
instances of course there would be no
point in skin detection if if you could
only show the computer the the piece of
skin that you already know what you want
to do is you want to have these training
examples to train the computer and then
later on you want to have a new patient
and this new person is going to come in
and you want to say for this new person
whether the person has skin cancer or
not and this is what we call
generalization so we train on a couple
of instances but then this rule that we
are going to find is supposed to
generalize to new instances of the same problem
ideally the computer should use the
examples to extract a general rule how
the specific task has to be performed
correctly so what happens in the
background or what is supposed to happen
is the computer takes its training
examples it has some mechanism by which
it can generalize a generator rule and
we are going to talk about many of these
mechanisms in the lecture and then
hopefully there is a new function that
comes out that is able to solve this
task in a very general way so now on a
high level this is what machine learning
is about of course this doesn't help you
very much at the current time but we are
going to see many examples in the
lecture however I still want to show you
yet another explanation and this is one
I like a bit more and it's much animal
to be able to explain you what I meant
to talk about now we first need to
figure out you know what is deduction
and induction now what you're going to
see from time to time I have questions
on my slides and if this would be a
normal lecture where people would sit in
the audience I would not ask you this
question the questions are always in in
bold font or in capital letters now as
you are watching this video at home I
guess I suggested whenever such a
question comes up you take a bit of time
you stop the video you think about the
question and then you proceed because
this would also be the same way we would
do it in a lecture and these questions
often help you to recap certain things
or to think about certain aspects of
what we are currently talking about so
at this point I would like to ask you
whether you know what deduction and
induction is and maybe you might want to
think about it for a minute and then
so here's the answer deduction or
deductive inference is the process of
reasoning from one or more general
statements premises to reach a logically
certain conclusion essentially this is
what is happening in math you say here
is statement one and two your statement
two and the first of these statements
are true then I can make a certain
conclusion from these statements and
here I have an example premise one every
person in this room is a student premise
two every student is older than 10 years
the conclusion is now every person in
this room is older than 10 years so the
important point is if the premises are
correct then what conclusions are
correct the conclusions you come to the
conclusions by the rules of logic and
you can always be certain if the
premises are correct then your
conclusion is correct as well this is a
very very nice framework of course and
all of logic is built on this all of
mathematics is built on this however the
big problem in in this kind of thinking
we need for machine learning is this
term if the premises are correct so
typically you can never make a step you
can never be certain about many things
there's always an uncertainty attached
to it and whenever a statement is not
all is not completely sure then this
kind of reasoning doesn't apply anymore
and so this is why detection is not very
well suited to to machine learning tasks
we use different mechanisms the other
principle that sort of the opposite to
deduction is induction inductive
inference is some kind of reasoning that
constructs or evaluates general
propositions that are derived from
specific examples so induction is what
we often do in science we observe many
things and we see some kind of pattern
and then we make a hypothesis and think
oh this pattern this is what is always
going to happen and then we have a
hypothesis and then then we keep on testing
testing
testing this hypothesis whether it's
true or false this process is induction
and here's an example if you are a kid
or you have a kid maybe that's more
closer to what is going to happen soon
so say you have a kid and what you're
going to see is when the kid is one or
two years old it keeps on dropping stuff
so it takes something it drops it it
takes another stuff it drops it and it
gets busy with this process for half a
year or a year and the kid is always
astonished at the thing at the end is on
the ground floor now what
and eventually the kid is going to learn
that whenever it drops things these
things are going to fall to the ground
floor and maybe not to the ceiling and
this is a process of inductive influence
you have this experiment you keep on
dropping stuff you observe that it
always falls down and then your
conclusion is that whenever you drop
stuff it is going to fall down and this
is inductive inference the important
thing is you can never really be sure
that your conclusion is correct and this
applies to all of science and there is a
lot of interesting philosophy of science
that tries to explain what does it mean
how can we learn something at all how
can we explain something and so on
because we cannot really be certain
about it humans - inductive reasoning
all the time essentially all our life is
coming up with good moods of some and in
performing induction here's one more example
example
you come late everything lecture 10
minutes so you I started the lecture and
after 10 minutes you enter the room for
the first couple of lectures I don't
really complain so you conclude well she
maybe doesn't really care whether I'm
late or not but you cannot be sure maybe
at lecture 10 I really get annoyed and
then there is something happening that
you didn't expect so here is a situation
of uncertainty in your reason so we
cannot be sure about the conclusions
that we make now why am I telling you
all of this here is now the second
motivation for what machine learning is
machine learning
tries to automate the
process of inductive inference and I
find this a very powerful explanation of
what machine learning is inductive
inference means we look at training data
for example because we always drop
things we have training data and we
build up some hypotheses and this is
exactly what machinery is supposed to do
we give some training examples to the
computer and the computer is then as
supposed to learn a general rule to come
up with a hypothesis how it could
explain future events or future examples
of the same process and the idea is that
machine learning is supposed to automate
this process so we want maybe to give
some basic framework but then the
algorithm is supposed to come up with
this rule in an automatic fashion and
this is an explanation of machine
learning that's very general of course
but I think it really explains what is
now I would like to discuss a bit why
people think the or why this can work at
all or whether it can work I mean you
see examples that it works so probably
it can work but there might be some
assumptions that we need to make and to
do this I want to consider a particular
example so here we have a particular
regression example so what we are given
is we have pairs of input points and
output points X I Y I so X is always the
input point Y is the output value that
we're supposed to have you see a plot of
some data I mean take this very
intuitive now we are going to make this
much more formal later on but for now
it's really about intuition look at the
data that is at the bottom of the slide
so we have four data points marked by
the crosses so you always see the x
value on the x axis and the y value on
the y axis and what we want to do is we
want to learn a general function that
can predict the Y values from the x
values so what we want to have is we
learn in fact we want to learn a
function that goes from the function f
that goes from the space Curly X which
is the space of all input points to the
space Curly Y which is the space of all
output points and now if I would ask you
in a lecture what do you think is the
value that you would predict if the
input value would be 0.4 so you might
want to look at this plot and think
about it a bit for yourself but I'm sure
the answer that most of you will come up
with is the following so well here you
on the x-axis you have this this K that
goes from 0 to 1 here we have zero point
four this is the point I'm interested in
now what would be probably the output
point of this one well it's going to be
roughly here and if we now assume this
is sort of a straight line the output of
this point might also be zero point four
this is a straightforward kind of
conclusion that you could draw from
these data points now you could also
come up with other conclusions and here
are two examples so the first guess this
is the one I've just explained to you is
that these data points have been
described by this or have been generated
by a linear function this kind of red
line the red line is sort of a good fit
to your existing data
then you can use this redline to predict
a new value for this for this point
you're interested in 0.4
it could also be the case what you have
see here on the right hand side maybe
for some reason you don't think it's a
linear function you come up with this
very kind of more shaped function which
has sort of goes up and down and up and
down and it also like this red function
also fits your existing data very well
but now if you would use this function
to predict you would get a different
prediction so the prediction now for 0.4
would maybe 0.8 as your output value and
the question is now which of these two
predictions is better or which of these
two red curves is more plausible and
this is now one of these points I would
like you to stop the video for a moment
and come up with it with arguments for
why the first one might be better or
maybe also why the second one might be
better what are the differences and what
might be criteria along which we could
okay I hope that you have made a couple
of ideas why each of these functions
could be better
typically what are the answers that I
get in these lectures if there is a real
audience in front of Miss many people
would say well I guess one is better
because it's a simpler function and
there's no reason if you just see the
data that we would need to fit it by
such a complicated function as on the
right hand side and so we would prefer
the drawing on the left hand side some
people would also use the word Occam's
razor because people have heard about
this before and would say Occam's razor
says you should always prefer the
simpler solution that can explain your
data and they would say this is a reason
why we should prefer the left hand side
all these things are correct up to a
certain point but we will see later in
this lecture that it's not always maybe
they are more twists to this explanation
and then there are also people who tend
to argue for the people who tend to
argue for the for the right hand side
they say well maybe
we have some background knowledge and we
know it's a physical phenomenon and this
phenomenon is not a linear phenomena but
it has is something that goes up and down
down
maybe it's the temperature at different
days of the different days and one of
these thoughts has been recorded at
night but the days are missing is now
zero point for us at daytime and
typically temperature goes up and down
during day and night so as she has this
background knowledge maybe guess two
would be better and then this might want
might lead to a better prediction than
guess one so the bottom line I want to
make here is if you don't have any extra
knowledge about your data there is no
way in which you can decide about which
of these things is really better you
need to have extra knowledge or you need
to make assumptions one such assumption
could be that the function should be
simple and then you would go to the left
hand side or the assumption could be
you're trying to fit a periodic function
and then you go to the right hand side
however you cannot make a prediction if
you do not make any assumption or make
any kind of bias which is the direction
so here's one more aspect now assume
that I tell you that the function values
have been generated randomly and if I
keep on generating REM data these are
simply random points in the unit square
so uniformly distributed and observe all
these data points are so you now see
we've been drawing many more points the
red point and now you can't see any
pattern anymore and if I now would ask
you what is your prediction at point
zero point four you would probably sell
say well in fact I don't know it could
be anything between zero and one and I
have no particular reason that it should
be zero point four for example it could
be anything else as well so here's the
inside if there is nothing that you
could predict I mean if you if you don't
have any pattern that sort of connects
the input to the output value you won't
be able to predict anything and I would
like to summarize this discussion now
the first consequence that we need to
take away from this discussion is we
will only be able to learn if there is
something we can learn in our data and
this there is something in the data
it sounds very trivial but in practice
this is often not so obvious so if you
have certain input data in for say
medicine and you want to predict a
certain output data say a particular
type of the disease and your input data
is the temperature of the person and
what the person has been eating during
the last days and the age of the person
and the shoe size
maybe this data is not enough to predict
this particular disease in this case
there is no connection between the input
data and the output data and you can try
whatever you want your machine learning
algorithm is never going to succeed and
this is something really important to
keep in mind when doing machine learning
it sounds trivial but in practice and
you might stumble into this problem very
often so the first thing the output
needs to have something to do with the
input and often an kind of bias or
assumption we make is that similar input
points would lead to similar output
values so again if you have certain
patients and you want to predict the
disease and you have very similar
patients the integra view would say well
these this these patients they behave so
similarly so probably they have the same
disease and this is the kind of inherent
rule that governs many many of the
machine learning algorithms of course
this is very abstract but still this is
what is in the back of machinery in many
applications then the next thing is
there needs to be a simple relationship
or a simple rule that can predict the
output from the input if this function
is something that it's so complicated if
your function is a fractal and you are
supposed to learn this fractal from ten
data points it is very unlikely that
you're going to succeed so the function
needs to be reasonably simple in order
for you to succeed the more training
data you have the more complicated for
instance you will be able to afford but
there needs to be some relationship you
can't learn the most complicated
function from just three data points
unless you make very very strong
sumption see the last point is where we
tend to look for a function that is
simple in some aspects we need to be a
bit careful with this notion and we are
going to see in later and statistical
learning theory
what simple really means and it's not
really I mean this is sort of Occam's
razor but it's not just Occam's razor
there are more aspects to this but we
are going to discuss it what words were
at the end of the lecture when we've
seen statistical learning series what is
not important is these assumptions that
we have on this slide they are rarely
made explicit so people run machine
learning algorithms and they press many
buttons and they try it out and they
require training and test error and so
on however you need to be aware that
these assumptions are always made in
machine learning but often it's a bit
unclear what really are the specific
assumptions that a certain algorithm
makes so always keep that in the back of
your mind when were children is being
applied there are assumptions that are
going to be brought into machine learning
learning
this has harm since I'm wrong it's very
likely that the function that you learn
is also wrong and you you might want to
be aware of what are the assumptions
that really go into your particular
the second consequence we said we are
going to look for a simple function and
so on but the more important thing is we
need to have an idea what we are looking
for and this idea of what we're looking
for is called the inductive bias of a
machine learning system as in the
previous example we need to say in
advance are we looking for a linear
function or are we looking for a
periodic function and this is sort of
our inherent knowledge on the data on
the phenomenon that we are trying to
model and this is called the inductive
bias and I want to give you a bit of
intuition what this really means I now
want to show you a simple example for
what this inductive bias means and why
we really need it and for this let me
simply draw an example so what we're
going to look at is a space that is just
one-dimensional so we have points
between 0 & 1 and the space consists of
a grid and the grid S is the Soviet 0.01
0.02 and so on so these are our input
points and now the output space is
either 0 or 1 so it can be so our
training data could be maybe I draw the
training data now in red so we could
have 1 training point here this is 1 and
so here at this particular input point
our output if you make an y-axis here
this 1 and maybe we have another point
here and here the output is minus 1/2
say okay so I now name the output 1 and
minus 1 on this slide it says the red
one so don't worry it's just we have two
different classes here say plus 1 and
minus 1 and now assume we have seen a
couple of training points so we have
seen these two red points and then we
have a couple of more points here and
this is our training data and the idea
is now we want to learn a function that
is going to predict for the remaining
data points what is going to be the
output value is it - 101 so what we want
to do is we want to learn a function the
tag that goes from the input space X to
the output space Y and now we start with different
different
in different situations in the first
case we assume we do not have any
inductive bias so any of these functions
that go from X from the space curly
extra curly Y can be the correct
function and this sounds great because
you want to say oh I don't want to
restrict my system I don't really know
what the process is that models I don't
know this this particular disease based
on the genetic information I don't have
any clue so I don't want to make any
assumption that sort of tries to bring
my algorithm into into a certain region
I simply want to have no prejudices I
want to learn without prejudices so you
don't make any inductive bias so what
are you going to do our function space
and maybe I put this here so this space
of all functions this is always going to
be denoted by curly F this is the space
of all functions F that go from the
input space to the output space all
functions now how many different
functions do we have our data space X
contains about 100 points each point can
be mapped to either minus 1 a plus 1 so
we have two to the two to the hundred
two many functions in this space so if
we would write it that lipstick absolute
value the number of functions in this
function class is two to the hundred
these are really a lot of functions so
but it sounds good so we have a powerful
function space that can model all
possible different things and we want to
learn without prejudices we now record
our first couple of data points so say
we have this red data points which are
here in the plot five data points and
assume we don't have any noise so we
assume we are in a perfect situation
that the training points that we get
always give us the correct exactly the
correct answer it's not like in a
medical case where you have some
uncertainty so we we live in a world
without any noise which is also a
simplifying assumption so now we have
seen five training points so we know for
example this particular point here is
going to be a plus one so what can what
does it help us we can now say well all
those functions in defines the space
that would assign minus one to this
space we can rule them out we can simply
throw them out because we know they are
not the
functions because we're in a noise-free
situation so this is not going to be the
correct function similarly for all the
other data points so for each of the
data points that we have we can rule out
all the functions that I that do not fit
this particular data point so what this
now means is that after seeing these
five data points our function space that
is sort of still contains the functions
that might be useful it's a bit smaller
so the function space may be I call it f
f5 after we have seen five point it is
now smaller it only contains two to the
ninety-five many different functions
okay so now we've seen these five
training points and now we want to
predict at a particular test point and
this test point like maybe I put it here
so this is the point where we want to
predict it is point X prime what are the
other possibilities that we can do which
is with this point we have now two to
the 94 many functions that are going to
predict that that the label of this
point is minus 1 and we have another two
to the 94 functions that would predict
that the label of this point is plus one
so our function space here we have those
functions that say f of X prime is plus
1 and we have functions that would say f
of X prime is minus 1 and here we have 2
to the 94 many functions and here we
also have 2 to the 94 many functions now
what are we supposed to predict for this
new data point we don't have a clue
there are as many functions for plus 1
as there for minus 1 and we don't have
any clue that would would tell us which
of these functions are which of the
functions at all is more plausible for
this particular data point and this is
where the inductive bias kicks in if we
do not have any bias there is no way in
which we can decide what is the correct
function here we need this inductive
bias otherwise we're doomed here and the
trick is this continuously now you would
say well maybe five data points are not
enough maybe I need 10 more data points
ok you take 10 more data points but
again for this new data point you still
don't know so no matter how much
data you are going to record for a point
that you haven't seen before you will
not be able to predict anything and this
is really the good point about machine
or the important point about machine
learning if we do not make any
assumption and you say we do not what we
do not want to have any prejudice we do
not want to make any assumptions we do
not want to restrict the space of
functions in which you are looking for a
solution it is not going to work what
I've shown you here is a bit of an
informal way there exists a more formal
way of stating the same result and this
is called no free lunch theorem the no
free lunch theorem essentially says
there is not a machine learning
algorithm that can always succeed and
you always need to make assumptions but
I don't want to become more formal about
the theorem here there are some slides
later in the lecture if you want to look
at that but I think for now for our
purposes this in formal reasoning kind
of works it's enough we now consider a
function space that only consists
exactly of two functions the function
constantly zero or the function
constantly one so it's always predicting
the same label no matter what is the
input point it's going to predict zero
or it's going to predict one and this is
our inductive bias so maybe for some
reason we know that the input data is
completely meaningless the output is
always going to be zero or it's always
going to be one we just don't know which
one and again we assume that there is no
noise in our system and now we observe
one particular data point and now once
we've seen this one data point we know
which is the correct function because we
know is it not we observe is it zero or
one I mean it's the output value zero
one and after observing it because we
don't have any noise I'm sorry um we
know what the inductive by by our
inductive bias we know that it's either
the function 0 constant 0 or the
function constant 1 and then we can
predict for all other values in our
space and we are always going to to get
and of course this is not a bit
simplistic so I've shown you two extreme
cases a one extreme case where we say we
don't have any inductive bias and we've
seen that we cannot learn anything at
all and the other case where we have a
very very strong inductive bias namely
we say it's just one out of two
functions and we don't have any noise so
with one training example we can explain
the world now the truth for many machine
learning applications is obviously
somewhere in the middle and now all of
machine learning or a major part of
machining consists in finding a good
function space for your particular
application now this problem of finding
a good function space is called model
selection of course these examples that
we had here is a really simplistic also
in the sense that we didn't consider any
noise and also we didn't really consider
what happens if this function space F
does not even contain the correct
function which in practice might happen
very often you say for some kind of
medical example you make certain
assumptions on what the function might
look like but maybe in truth it looks a
bit differently and ideally your machine
learning algorithm should still work at
least it's up to a certain accuracy now
figuring out all these details how these
how these things fit together like the
amount of training data the model
selection problem which kind of
functions to use what is the amount of
noise and so on this is really tricky
and all of machine learning is
essentially the science to solve all
these problems and it is one of the big
success stories of machine learning and
in particular the theoretical part of
machine learning that at least in some
standard algorithms and we have worked
out exactly how these things work
together and it's really well understood
and at the end of this lecture course
you will at least understand the rough
there are two important terms when it
comes to model selection or finding
selecting a good function and these
terms are overfitting and underfitting
and again this is sort of supposed to
demonstrate why finding a good function
class is so crucial consider the
following example here we see it on the
slide the true function is a quadratic
function so you have this parabola and
your data points so this is in the plot
it's the black function and your
training points have been generated from
this function with a bit of noise so you
see the green crosses in the plot which
are the training points that roughly
follow the parabola but there's a bit of
noise and now you want to learn this
function and now there are different
choices of your function class that you
could take you could say oh I fit a
really simple model I fit a linear line
so you make my camp with this red line
or you say I want to fit a polynomial
function of degree 20 when you come up
with this blue line and what you can see
now is they're different reasons for why
these things go wrong the red line
somehow seems too simple it doesn't even
fit the training data so and this is
what it's what it's going to be called
under fitting and the blue line is so
extreme it tries to fit each little
aspect of your training data that it's
going to be called overfitting this is
also wrong and here are the explanations
for what these are overfitting means we
can always find a function that explains
all the training points very well or
even exactly those functions tend to be
complicated and they tend to fit the
noise as well this is what we also see
on the previous plot the blue line goes
nearly through all the data points even
though there is noise on the data points
and maybe the true function is not
supposed to go through all the data
points because there is noise and we
don't want to model the noise as well
predictions on this data for these kind
of functions that are overfitting are
for up or for new data points because
this function is somewhat too
complicated and later we are going to
see that overfitting is characterized by
the fact that we have a low approximation
approximation
error in the high estimation error and
don't worry about this now we are going
to see this in the next lecture the
opposite effect is called underfitting
here your model is too simplistic you
want to use a linear function even
though your data does not it cannot be
described by a linear function the
advantage is that the estimated function
tends to be very stable with respect to
noise so if you add a couple of data
points or really a bit this linear
function is not going to move a lot but
for unseen points again the predictions
are going to be poor in the regime of
underfitting is characterized by the
fact that there is a large approximation
error and a low estimation error again
and we are going to talk about this in
the next lecture at the end of this
lecture I want to show you that this
notion of an inductive bias is not only
used in technical systems it is in you
it is used in all systems that are
supposed to learn and in particular also
animals or humans so I want to stress
again that cannot exist a learning
system that does not have an inductive
bias and we as humans also need to have
inductive biases otherwise we could not
be able to learn now we don't want to
make experiments with humans to figure
that out but they have been experiments
with animals in the 1960s which have
tried to show that and I want to explain
you what it is about so this is an
experiment with rats and now consider
you have a red and the red has a choice
of two types of water so there are two
water bowls here there are two types of
water one is normal water and the other
water makes it fusing and the rat
hopefully is supposed to learn to avoid
that water that makes it sick and only
drink from the from the normal water now
if there wouldn't be any anything by
which the rat could decide any features
that would make the show the difference
then the rat wouldn't be able to learn
that but now there is a future and in
the first experiment the two types of
water tastes differently so what you
have is you have one type of water the
tastes neutrally and the other one has
been sweetened by sugar so a test suit
and now you have the red and the drinks
and if it drinks from one type of water
it gets sick and if you drink from the
other type of water it doesn't get sick
and now the red as you can observe in
this or has been excel observed in these
experiments the rats learn very very
fast to only drink from this type of
water that doesn't taste sweet so the
rats even if you put the bowls in a
different spot and so on the red it
would try a tiny bit of the water and if
it's a if the sweet water is the one
that makes it tick it wouldn't drink it
well if it's sweet and if the red dance
is very fast okay so far so good nothing
really surprising now there was a second
experiment it's again the same set up
you have two bowls of water they can be
in different places in the cage one of
the waters makes the rat tick and the
other one doesn't
but now the difference between these the
features that the red can distinguish
about these types of water is not the taste
taste
so both waters taste the same but the
the world like one type of water is
accompanied by a certain certain sound
and light effects so say one of the road
that one of the waters is in a room that
is has a red light or there's a certain
pair that that you can listen to if
you're close to that water and in the in
the paper they wrote a write about audio
visual stimuli and so you have certain
sound and lightning conditions and then
the water like you have one type of
water which has these kind of conditions
this particular sound and you have
another type of water which does not
have this conditioner now again the rat
is supposed to learn which of which type
of water makes it tick and which one
doesn't and the surprising thing is now
the rat cannot learn it the read does
not learn the connection between the
fact that this water that makes it tick
has something to do with the lightning
conditions in your room so apparently
the read does not have an inductive bias
that could make help it make the
connection between lightning conditions
in the room and water that gets it makes
and if you think about is it plausible
or not you can of course in hindsight
come up with a plausible explanation so
if a read like it's out there in the
wild and they need to taste food and the
food tastes funny then maybe the food is
rotten and the red doesn't want to eat
it anymore
so sort of connecting the taste of food
with the fact of whether this food makes
it tick or not as something very natural
and the red has this connection sort of
wired in its brain but now different
lightning called lighting conditions in
a room typically don't have anything to
do at least in nature with the fact of
whether some food is rotten or not so
you can look at it at night or a day and
in one situation it's bright and then
the other it's not bright but it doesn't
have any influence on whether this food
makes it sick or not and so apparently
the brain of the red is not able to make
this connection so the red has an
inductive bias it simply cannot learn
this function and there's no way it can
sort of overcome this the the bias of
its brain is that it cannot learn it
this effect has been investigated a lot
in psychology it's called the Garcia
effect because it has been published by
a researcher called John Garcia and his
co-workers in the 1960s and there you
can see one of the references but there
are many more references out there so
what is now the inductive bias the
bottom line any successful learning
algorithm has an inductive bias we tend
to prefer to select hypotheses from some
restricted of more small function cases
function spaces because it helps us to
focus on the functions that are
important and whether this function is
then the function that has been learned
by the algorithm is close to the truth
really depends on whether this function
classes were selected for the problem at hand
hand
we haven't really been talking about
this at all but this is obvious like if
you have a function class you say your
function class contains linear functions
but the phenomenon that you're trying to
model is a periodic function then no
matter what you want to do what you're
going to do it's not going to work out
what it's the important message that I
want to give you now is for some
algorithms it's sort of improve it's
sort of obvious what the inductive bias
is going to be and we are going to discuss
discuss
that for other algorithms it's not
obvious but there has to be an inductive
bias machine learning is impossible
without an inductive bias and it is
important to keep that in mind in
particular if you get honey results
maybe your inductive bias is from or
even if you have results and they look
really good you might want to ask
yourself at some point while it's the
inductive bias really the wrong they
correct one or the wrong one and all
these points are going to be made more
precise in the future of this machine
learning lecture so I'm going I'm hoping
that you're going to stay with us and
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.