This lecture introduces statistical machine learning by presenting motivating examples and defining machine learning as the process of automating inductive inference, emphasizing the necessity of inductive bias for any learning system to function.
Mind Map
クリックして展開
クリックしてインタラクティブなマインドマップを確認
good morning everybody today is the
first lecture in statistical machine
learning and we would like to start this
lecture with a couple of motivating
examples the first example I want to
talk about is handwritten digit
recognition this is one of the founding
programs of machine learning and in the
1990s it was considered quite a lot
under the name of pattern recognition
the problem is as following assume
you're at the postal service and you
want to deliver a letter so the letter
comes on some moving belt you have a
camera that takes a picture of your
letter and now you want to automatically
recognize the address and the zip code
of the city where the letter is supposed
to go to so the problem is you have this
this photograph but now it's not so easy
to give a handwritten rule that says oh
look this this and these letters are an
F and this digit is a 7 and so on you
need to have some systems that are more
flexible and you don't want to hand this
and this rule by yourself but you want
the system to find out a rule that can
recognize these digits looking at it a
bit closer assume that we have a
photograph of a digit and like here on
the slide we have the digit 3 in this
case it's a 16 by 16 grayscale image so
each pixel in this image is like a pixel
is one of these little squares and it
has a grayscale value it is a number
between 0 & 1 0 means white one means
black and every number in between is
some some shade of gray and now what is
it that the computer sees about the
stitch about this digit it doesn't see
it as an image it sees it as a vector of
in this case 256 many numbers between 0
& 1 so here in the bottom of the slide
what we are supposed to do is now we
need to learn a function it takes as an
input such a number which has 256
entries between 0 & 1 and the out part
of this function is supposed to be the
digit that is represented by this
particular vector and this is one of the
founding programs in machine learning
and if you look at this this particular
slide you can already see that it's not
so simple we have different versions of
the digit 5 for the digit 9 the digit 7
and 1 and you all can already see that
it's quite easy to mix up the digits and
for example this 9 year looks a bit
atypical and this one might even be a 5
and here the difference between some of
these sevens and some of these ones is
also not so easy and here the idea is we
want to use machine learning to solve
this problem of handwritten digit
recognition and in one of the first
exercises in this class you're going to
solve this problem yourself by a very
simple algorithm in fact another problem
that's also quite old spam filtering in
the 1990s when emails came up and spam
maybe was not so much of a problem but
it soon became a problem everybody gets
all the spam emails and you want to
design a filter that automatically can
tell apart normal emails from spam
emails and again it might be easy to
give a couple of keywords that might
hint that there is a spam that a
particular email is a spam email but in
general this is not so easy and
handwritten rules often don't work very
well so what all the email programs have
internally they have a so-called spam
filter the idea is you get your incoming
emails and whenever you encounter a spam
email in your inbox you press the button
spam and in the background there's a
machine learning classifier that tries
to classify immerse into spam and
non-spam and whenever you press this
button this classifier is being updated
in an online fashion and hopefully in
this way the spam filter is always
up-to-date and can detect emails which
are spam emails and can separate them
this is a typical online learning
problem as opposed to the handwritten
digit problem where we train the machine
learning classifier once and forever and
then hopefully it can classify all the
digits here in the state recognition
nothing really changes digits are there
people have the handwriting but there is
no evolution over time whereas in spam
filtering of course it's a game between
yourself and your opponent who is the
person who wants to send the spam emails
so whenever you have updated your spam
filter the person is trying to invent
some new spam image and this is going on
going over time so in an online fashion
you want to solve the machine learning
problem always with the most up-to-date
tools and this is called an online
machine learning problem a very
important machine learning application
is object detection imagine self-driving
cars here you have a scene from some
from some road and traffic and you want
to recognize like the self-driving car
is supposed to recognize that there are
pedestrians there are other cars there
might be a traffic light or traffic some
other traffic on the cyclist and so on
and so the problem of object detection
is given a complex image like a scene
like this one you want to recognize what
is on the scene it is a more general
version of handwritten digit recognition
but it was much much more complicated
scenes and many many more types of
objects this problem was one of the
important problems that needed to be
solved for self-driving cars and
self-driving cars are one of the big one
of the big applications of machine
learning out there
in 2005 there was one like people have
been trying to build safe driving cars
since many many years and the first very
important breakthrough happened in 2005
when there was a race where cars were
supposed to drive autonomously through a
desert for a hundred kilometers and they
just got the GPS coordinates for like
they also were starting at the same
point and got the GPS coordinates of the
point where they were supposed to go and
for the first time in 2005 one single or
a couple of cars managed to solve this
challenge and from then on self-driving
cars really became much
more prominent and have been developed
and are now about to be rolled off in
many of the in for example in some
cities in the US you can already try
self-driving cars in Germany they are
not yet out there there are some third
some technically problems but there are
also some problems that come from the
law and from from responsibility one big
application area of machine learning
already since quite some years is the
field of bioinformatics there are many
many machine learning algorithms that
are applied to a wide variety of
problems for example you one of the
starting problems was you wanted to
detect different like different types of
diseases from microarray data so I'm not
a bio biologist but my understanding of
microarray data is you have a certain
cell and their proteins that might be
active or not and you have know some
kind of lab experiment that can measure
whether a certain protein is active or
not and in this little image that you
can see on this slide each of the blue
or green egg green or red dots stands
for this particular protein is active or
is not active now you want to classify
different types of cancer cells for
example based on this pattern so it's a
bit like in handwritten digit
recognition you have a matrix consisting
of zeros and one say green and red dots
and you want to say which of these
patterns belongs to a particular disease
because a cell behaves in a certain way
another application is drug discovery
where you want to say you have a disease
and you want to design a drug and to be
able to do that what you need to do you
have this protein maybe the virus and
you want to knock it out so what you
might want to do is you want to find a
small molecule that can bind to this
protein and then do certain things to
the protein so you need to first find a
molecule that can bind to the protein
and here on this slide you see an
example the a protein has a very
complicated three-dimensional structure
and it has these little pockets in a
three-dimensional structure and now you
need to find a molecule that exactly can
bind inside such a pocket and again this
would be very expensive to try in a lab
you have like thousands of different
molecules that might work might be
working but you don't want to run a lab expert
expert
and for all of them you might want to
speak to pre-screen the different
molecules and in order to do that and
you use machine learning again you have
a certain description of these pockets
it says how large is the pocket water
maybe the the molecules that sit at the
side of the pockets what are the binding
energies of all these molecules and
based on this description you want to
predict whether a certain molecule is
now going to fit into this pocket or not
this is again a classification problem
that you might be able to solve with
machine learning here just for the
people who work in bioinformatics I have
one slide that shows which are all the
different fields in which bio and in
which machine learning is used in
bioinformatics if you want you can look
at it at home going from bioinformatics
more towards medical applications one of
the very big fields also very prominent
currently in machine learning is
applications in medicine for example in
personalized medicine you want to hand
design different therapies to particular
genetic to the genetic disposition of
particular people or here I have an
example for skin to cancer detection
where the idea is you have a it's again
some kind of image detection object
detection problem the idea is you as a
person you think you have a funny piece
of color at your skin and you wonder
whether it's a it has to do with skin
cancer or not so what you do is you take
a do you take your smartphone you take a
photograph of the skin and then you use
an automatic classifier that might say
oh this is very harmful or this is this
might be harmful or this is not harmful
and then depending on the outcome you
can start consulting a doctor and the
impressive thing is that these systems
by now are at least at the accuracy of
medical experts who have really been
trained for years to detect different
types of skin cancer so machine learning
at this in this particular application
is really a very powerful tool that can
support doctors who then can focus more
there are many many more applications in
science and here I will just want to
outline one in which is a bit funny it's
an archaeology and you would think that
archaeology is maybe the last field
where machine learning could could be of
an advantage but here is a nice and
paper that has been published 1990 to
2019 so last year in Nature
communications where people have been
analyzing the human genome from ancient
findings and they found some evidence
that tried to reconstruct this
development tree in which different
different kinds of humans have have been
developing and they found that there is
there must be an additional branch in
this tree that has not been discovered
yet so we have not found any bone of
this particular branch in the human
development but it must be there because
otherwise you couldn't explain the data
that you currently have with if you
don't assume that such a trick such a
branch exists and this is I think a cool
application because it shows that
machine learning not only can solve very
specific classification problems but it
can really discover things that you
one of the fields where machine learning
is very powerful nowadays is language
processing one first breakthrough was in
2011 when the computer Watson won this
there was a question in the US that was
called yopo do you it is a bit like the
German we have a millionaire so there
are questions that have been asked and
then the persons on this case a computer
is supposed to answer and the
interesting thing here is that these
questions are more like word games it's
not so much about who won the soccer
championship in 1950 55 or so it's more
like kind of word games and as a
surprising thing was that this computer
Watson was able to beat the best
geo Purdue players at that time by now a
language processing is very very
prominent and you see you have cereal on
your phone or Alexa or you can also try
automatic translation systems like if
you haven't seen that before a deep L is
one of my favorite trance
Rajon services you paste an English
sentence and it spits out the perfect
German sentence or the other way around
and this is really impressive and this
wouldn't have been possible a couple of
years ago one last thing I want to
mention is I forego many of you might
have heard about it
so chess is an old play an old game
which has been solved by computers
already in 1996 at that time there was a
computer which was able to beat the
world champion
the world leader in chess playing at
that time gathering garlic has power
however at that time they didn't use any
machine learning for this essentially
what they did is a very clever search
procedure combined with a very very
powerful computer so essentially at this
at this time 1996 for chess they
essentially managed to look I had a
couple of steps and evaluate all the
different possibilities and the
different directions and the opponent
might have and in this in this fashion
managed to beat the best chess player
who might not have such a huge such a
huge computational power to look ahead
for say five steps now it's a very
different story with alphago when I
forego that was in 2016
did mind I managed to program or go
playing machines purely using machine
learning and that was really a big
breakthrough at the time sorry I don't
have a slide for this so what happened
is essentially you they used neural
networks to sort of represent the the
situation on the board and then they
they first fed in your network with
games that have been played by experts
to try to train it to do the same kind
of moves that experts have done and then
in the next step they led to different
systems of alphago play against each
other in order to improve and improve
and improve and in the end they managed
to beat this the world championship at
the time so now we've seen many examples
where machine learning plays an
important role but now what is machine
learning how can you define it is there
a definition at all or how could you
explain what happens in the background
of course I mean we're going to spend a
whole semester trying to discover it but
let's try to start with a couple of
if you look at what what is in Wikipedia
are in many of these online blocks they
try to explain machine learning you will
find something along those lines machine
learning is the development of
algorithms which allow a computer to
learn specific tasks from training
examples and there are a couple of words
that are really important here the first
one is specific tasks machine learning
is not or at least in my opinion it's
not about building general artificial
intelligence so you don't want to build
an agent that is like a robot that is
really intelligent as a human what we
try to do in machine learning is to
build algorithms that can solve very
specific tasks it could be skin
detection skin cancer detection or it
could be language translation what could
be to play alphago but you're not or at
least currently we are not trying to
build an agent that can do all these
tasks at the same time but whenever you
want your algorithm to do a new task you
need to train and this training for this
training typically you need training
examples so you need examples of the
tasks that you are supposed to - that
the computer is supposed to learn for
example unit in skin cancer detection
you need images of different pieces of
skin and then you need to have the labor
which says this is skin cancer and this
is normal skin now the next point is
learning means that the computer cannot
only memorize the scene examples but can
generalize to previously unseen
instances of course there would be no
point in skin detection if if you could
only show the computer the the piece of
skin that you already know what you want
to do is you want to have these training
examples to train the computer and then
later on you want to have a new patient
and this new person is going to come in
and you want to say for this new person
whether the person has skin cancer or
not and this is what we call
generalization so we train on a couple
of instances but then this rule that we
are going to find is supposed to
generalize to new instances of the same problem
ideally the computer should use the
examples to extract a general rule how
the specific task has to be performed
correctly so what happens in the
background or what is supposed to happen
is the computer takes its training
examples it has some mechanism by which
it can generalize a generator rule and
we are going to talk about many of these
mechanisms in the lecture and then
hopefully there is a new function that
comes out that is able to solve this
task in a very general way so now on a
high level this is what machine learning
is about of course this doesn't help you
very much at the current time but we are
going to see many examples in the
lecture however I still want to show you
yet another explanation and this is one
I like a bit more and it's much animal
to be able to explain you what I meant
to talk about now we first need to
figure out you know what is deduction
and induction now what you're going to
see from time to time I have questions
on my slides and if this would be a
normal lecture where people would sit in
the audience I would not ask you this
question the questions are always in in
bold font or in capital letters now as
you are watching this video at home I
guess I suggested whenever such a
question comes up you take a bit of time
you stop the video you think about the
question and then you proceed because
this would also be the same way we would
do it in a lecture and these questions
often help you to recap certain things
or to think about certain aspects of
what we are currently talking about so
at this point I would like to ask you
whether you know what deduction and
induction is and maybe you might want to
think about it for a minute and then
so here's the answer deduction or
deductive inference is the process of
reasoning from one or more general
statements premises to reach a logically
certain conclusion essentially this is
what is happening in math you say here
is statement one and two your statement
two and the first of these statements
are true then I can make a certain
conclusion from these statements and
here I have an example premise one every
person in this room is a student premise
two every student is older than 10 years
the conclusion is now every person in
this room is older than 10 years so the
important point is if the premises are
correct then what conclusions are
correct the conclusions you come to the
conclusions by the rules of logic and
you can always be certain if the
premises are correct then your
conclusion is correct as well this is a
very very nice framework of course and
all of logic is built on this all of
mathematics is built on this however the
big problem in in this kind of thinking
we need for machine learning is this
term if the premises are correct so
typically you can never make a step you
can never be certain about many things
there's always an uncertainty attached
to it and whenever a statement is not
all is not completely sure then this
kind of reasoning doesn't apply anymore
and so this is why detection is not very
well suited to to machine learning tasks
we use different mechanisms the other
principle that sort of the opposite to
deduction is induction inductive
inference is some kind of reasoning that
constructs or evaluates general
propositions that are derived from
specific examples so induction is what
we often do in science we observe many
things and we see some kind of pattern
and then we make a hypothesis and think
oh this pattern this is what is always
going to happen and then we have a
hypothesis and then then we keep on testing
testing
testing this hypothesis whether it's
true or false this process is induction
and here's an example if you are a kid
or you have a kid maybe that's more
closer to what is going to happen soon
so say you have a kid and what you're
going to see is when the kid is one or
two years old it keeps on dropping stuff
so it takes something it drops it it
takes another stuff it drops it and it
gets busy with this process for half a
year or a year and the kid is always
astonished at the thing at the end is on
the ground floor now what
and eventually the kid is going to learn
that whenever it drops things these
things are going to fall to the ground
floor and maybe not to the ceiling and
this is a process of inductive influence
you have this experiment you keep on
dropping stuff you observe that it
always falls down and then your
conclusion is that whenever you drop
stuff it is going to fall down and this
is inductive inference the important
thing is you can never really be sure
that your conclusion is correct and this
applies to all of science and there is a
lot of interesting philosophy of science
that tries to explain what does it mean
how can we learn something at all how
can we explain something and so on
because we cannot really be certain
about it humans - inductive reasoning
all the time essentially all our life is
coming up with good moods of some and in
performing induction here's one more example
example
you come late everything lecture 10
minutes so you I started the lecture and
after 10 minutes you enter the room for
the first couple of lectures I don't
really complain so you conclude well she
maybe doesn't really care whether I'm
late or not but you cannot be sure maybe
at lecture 10 I really get annoyed and
then there is something happening that
you didn't expect so here is a situation
of uncertainty in your reason so we
cannot be sure about the conclusions
that we make now why am I telling you
all of this here is now the second
motivation for what machine learning is
machine learning
tries to automate the
process of inductive inference and I
find this a very powerful explanation of
what machine learning is inductive
inference means we look at training data
for example because we always drop
things we have training data and we
build up some hypotheses and this is
exactly what machinery is supposed to do
we give some training examples to the
computer and the computer is then as
supposed to learn a general rule to come
up with a hypothesis how it could
explain future events or future examples
of the same process and the idea is that
machine learning is supposed to automate
this process so we want maybe to give
some basic framework but then the
algorithm is supposed to come up with
this rule in an automatic fashion and
this is an explanation of machine
learning that's very general of course
but I think it really explains what is
now I would like to discuss a bit why
people think the or why this can work at
all or whether it can work I mean you
see examples that it works so probably
it can work but there might be some
assumptions that we need to make and to
do this I want to consider a particular
example so here we have a particular
regression example so what we are given
is we have pairs of input points and
output points X I Y I so X is always the
input point Y is the output value that
we're supposed to have you see a plot of
some data I mean take this very
intuitive now we are going to make this
much more formal later on but for now
it's really about intuition look at the
data that is at the bottom of the slide
so we have four data points marked by
the crosses so you always see the x
value on the x axis and the y value on
the y axis and what we want to do is we
want to learn a general function that
can predict the Y values from the x
values so what we want to have is we
learn in fact we want to learn a
function that goes from the function f
that goes from the space Curly X which
is the space of all input points to the
space Curly Y which is the space of all
output points and now if I would ask you
in a lecture what do you think is the
value that you would predict if the
input value would be 0.4 so you might
want to look at this plot and think
about it a bit for yourself but I'm sure
the answer that most of you will come up
with is the following so well here you
on the x-axis you have this this K that
goes from 0 to 1 here we have zero point
four this is the point I'm interested in
now what would be probably the output
point of this one well it's going to be
roughly here and if we now assume this
is sort of a straight line the output of
this point might also be zero point four
this is a straightforward kind of
conclusion that you could draw from
these data points now you could also
come up with other conclusions and here
are two examples so the first guess this
is the one I've just explained to you is
that these data points have been
described by this or have been generated
by a linear function this kind of red
line the red line is sort of a good fit
to your existing data
then you can use this redline to predict
a new value for this for this point
you're interested in 0.4
it could also be the case what you have
see here on the right hand side maybe
for some reason you don't think it's a
linear function you come up with this
very kind of more shaped function which
has sort of goes up and down and up and
down and it also like this red function
also fits your existing data very well
but now if you would use this function
to predict you would get a different
prediction so the prediction now for 0.4
would maybe 0.8 as your output value and
the question is now which of these two
predictions is better or which of these
two red curves is more plausible and
this is now one of these points I would
like you to stop the video for a moment
and come up with it with arguments for
why the first one might be better or
maybe also why the second one might be
better what are the differences and what
might be criteria along which we could
okay I hope that you have made a couple
of ideas why each of these functions
could be better
typically what are the answers that I
get in these lectures if there is a real
audience in front of Miss many people
would say well I guess one is better
because it's a simpler function and
there's no reason if you just see the
data that we would need to fit it by
such a complicated function as on the
right hand side and so we would prefer
the drawing on the left hand side some
people would also use the word Occam's
razor because people have heard about
this before and would say Occam's razor
says you should always prefer the
simpler solution that can explain your
data and they would say this is a reason
why we should prefer the left hand side
all these things are correct up to a
certain point but we will see later in
this lecture that it's not always maybe
they are more twists to this explanation
and then there are also people who tend
to argue for the people who tend to
argue for the for the right hand side
they say well maybe
we have some background knowledge and we
know it's a physical phenomenon and this
phenomenon is not a linear phenomena but
it has is something that goes up and down
down
maybe it's the temperature at different
days of the different days and one of
these thoughts has been recorded at
night but the days are missing is now
zero point for us at daytime and
typically temperature goes up and down
during day and night so as she has this
background knowledge maybe guess two
would be better and then this might want
might lead to a better prediction than
guess one so the bottom line I want to
make here is if you don't have any extra
knowledge about your data there is no
way in which you can decide about which
of these things is really better you
need to have extra knowledge or you need
to make assumptions one such assumption
could be that the function should be
simple and then you would go to the left
hand side or the assumption could be
you're trying to fit a periodic function
and then you go to the right hand side
however you cannot make a prediction if
you do not make any assumption or make
any kind of bias which is the direction
so here's one more aspect now assume
that I tell you that the function values
have been generated randomly and if I
keep on generating REM data these are
simply random points in the unit square
so uniformly distributed and observe all
these data points are so you now see
we've been drawing many more points the
red point and now you can't see any
pattern anymore and if I now would ask
you what is your prediction at point
zero point four you would probably sell
say well in fact I don't know it could
be anything between zero and one and I
have no particular reason that it should
be zero point four for example it could
be anything else as well so here's the
inside if there is nothing that you
could predict I mean if you if you don't
have any pattern that sort of connects
the input to the output value you won't
be able to predict anything and I would
like to summarize this discussion now
the first consequence that we need to
take away from this discussion is we
will only be able to learn if there is
something we can learn in our data and
this there is something in the data
it sounds very trivial but in practice
this is often not so obvious so if you
have certain input data in for say
medicine and you want to predict a
certain output data say a particular
type of the disease and your input data
is the temperature of the person and
what the person has been eating during
the last days and the age of the person
and the shoe size
maybe this data is not enough to predict
this particular disease in this case
there is no connection between the input
data and the output data and you can try
whatever you want your machine learning
algorithm is never going to succeed and
this is something really important to
keep in mind when doing machine learning
it sounds trivial but in practice and
you might stumble into this problem very
often so the first thing the output
needs to have something to do with the
input and often an kind of bias or
assumption we make is that similar input
points would lead to similar output
values so again if you have certain
patients and you want to predict the
disease and you have very similar
patients the integra view would say well
these this these patients they behave so
similarly so probably they have the same
disease and this is the kind of inherent
rule that governs many many of the
machine learning algorithms of course
this is very abstract but still this is
what is in the back of machinery in many
applications then the next thing is
there needs to be a simple relationship
or a simple rule that can predict the
output from the input if this function
is something that it's so complicated if
your function is a fractal and you are
supposed to learn this fractal from ten
data points it is very unlikely that
you're going to succeed so the function
needs to be reasonably simple in order
for you to succeed the more training
data you have the more complicated for
instance you will be able to afford but
there needs to be some relationship you
can't learn the most complicated
function from just three data points
unless you make very very strong
sumption see the last point is where we
tend to look for a function that is
simple in some aspects we need to be a
bit careful with this notion and we are
going to see in later and statistical
learning theory
what simple really means and it's not
really I mean this is sort of Occam's
razor but it's not just Occam's razor
there are more aspects to this but we
are going to discuss it what words were
at the end of the lecture when we've
seen statistical learning series what is
not important is these assumptions that
we have on this slide they are rarely
made explicit so people run machine
learning algorithms and they press many
buttons and they try it out and they
require training and test error and so
on however you need to be aware that
these assumptions are always made in
machine learning but often it's a bit
unclear what really are the specific
assumptions that a certain algorithm
makes so always keep that in the back of
your mind when were children is being
applied there are assumptions that are
going to be brought into machine learning
learning
this has harm since I'm wrong it's very
likely that the function that you learn
is also wrong and you you might want to
be aware of what are the assumptions
that really go into your particular
the second consequence we said we are
going to look for a simple function and
so on but the more important thing is we
need to have an idea what we are looking
for and this idea of what we're looking
for is called the inductive bias of a
machine learning system as in the
previous example we need to say in
advance are we looking for a linear
function or are we looking for a
periodic function and this is sort of
our inherent knowledge on the data on
the phenomenon that we are trying to
model and this is called the inductive
bias and I want to give you a bit of
intuition what this really means I now
want to show you a simple example for
what this inductive bias means and why
we really need it and for this let me
simply draw an example so what we're
going to look at is a space that is just
one-dimensional so we have points
between 0 & 1 and the space consists of
a grid and the grid S is the Soviet 0.01
0.02 and so on so these are our input
points and now the output space is
either 0 or 1 so it can be so our
training data could be maybe I draw the
training data now in red so we could
have 1 training point here this is 1 and
so here at this particular input point
our output if you make an y-axis here
this 1 and maybe we have another point
here and here the output is minus 1/2
say okay so I now name the output 1 and
minus 1 on this slide it says the red
one so don't worry it's just we have two
different classes here say plus 1 and
minus 1 and now assume we have seen a
couple of training points so we have
seen these two red points and then we
have a couple of more points here and
this is our training data and the idea
is now we want to learn a function that
is going to predict for the remaining
data points what is going to be the
output value is it - 101 so what we want
to do is we want to learn a function the
tag that goes from the input space X to
the output space Y and now we start with different