YouTube Transcript:
Statistical Machine Learning Part 1 - Machine learning and inductive bias

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This lecture introduces statistical machine learning by presenting motivating examples and defining machine learning as the process of automating inductive inference, emphasizing the necessity of inductive bias for any learning system to function.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

good morning everybody today is the

first lecture in statistical machine

learning and we would like to start this

lecture with a couple of motivating

examples the first example I want to

talk about is handwritten digit

recognition this is one of the founding

programs of machine learning and in the

1990s it was considered quite a lot

under the name of pattern recognition

the problem is as following assume

you're at the postal service and you

want to deliver a letter so the letter

comes on some moving belt you have a

camera that takes a picture of your

letter and now you want to automatically

recognize the address and the zip code

of the city where the letter is supposed

to go to so the problem is you have this

this photograph but now it's not so easy

to give a handwritten rule that says oh

look this this and these letters are an

F and this digit is a 7 and so on you

need to have some systems that are more

flexible and you don't want to hand this

and this rule by yourself but you want

the system to find out a rule that can

recognize these digits looking at it a

bit closer assume that we have a

photograph of a digit and like here on

the slide we have the digit 3 in this

case it's a 16 by 16 grayscale image so

each pixel in this image is like a pixel

is one of these little squares and it

has a grayscale value it is a number

between 0 & 1 0 means white one means

black and every number in between is

some some shade of gray and now what is

it that the computer sees about the

stitch about this digit it doesn't see

it as an image it sees it as a vector of

in this case 256 many numbers between 0

& 1 so here in the bottom of the slide

what we are supposed to do is now we

need to learn a function it takes as an

input such a number which has 256

entries between 0 & 1 and the out part

of this function is supposed to be the

digit that is represented by this

particular vector and this is one of the

founding programs in machine learning

and if you look at this this particular

slide you can already see that it's not

so simple we have different versions of

the digit 5 for the digit 9 the digit 7

and 1 and you all can already see that

it's quite easy to mix up the digits and

for example this 9 year looks a bit

atypical and this one might even be a 5

and here the difference between some of

these sevens and some of these ones is

also not so easy and here the idea is we

want to use machine learning to solve

this problem of handwritten digit

recognition and in one of the first

exercises in this class you're going to

solve this problem yourself by a very

simple algorithm in fact another problem

that's also quite old spam filtering in

the 1990s when emails came up and spam

maybe was not so much of a problem but

it soon became a problem everybody gets

all the spam emails and you want to

design a filter that automatically can

tell apart normal emails from spam

emails and again it might be easy to

give a couple of keywords that might

hint that there is a spam that a

particular email is a spam email but in

general this is not so easy and

handwritten rules often don't work very

well so what all the email programs have

internally they have a so-called spam

filter the idea is you get your incoming

emails and whenever you encounter a spam

email in your inbox you press the button

spam and in the background there's a

machine learning classifier that tries

to classify immerse into spam and

non-spam and whenever you press this

button this classifier is being updated

in an online fashion and hopefully in

this way the spam filter is always

up-to-date and can detect emails which

are spam emails and can separate them

this is a typical online learning

problem as opposed to the handwritten

digit problem where we train the machine

learning classifier once and forever and

then hopefully it can classify all the

digits here in the state recognition

nothing really changes digits are there

people have the handwriting but there is

no evolution over time whereas in spam

filtering of course it's a game between

yourself and your opponent who is the

person who wants to send the spam emails

so whenever you have updated your spam

filter the person is trying to invent

some new spam image and this is going on

going over time so in an online fashion

you want to solve the machine learning

problem always with the most up-to-date

tools and this is called an online

machine learning problem a very

important machine learning application

is object detection imagine self-driving

cars here you have a scene from some

from some road and traffic and you want

to recognize like the self-driving car

is supposed to recognize that there are

pedestrians there are other cars there

might be a traffic light or traffic some

other traffic on the cyclist and so on

and so the problem of object detection

is given a complex image like a scene

like this one you want to recognize what

is on the scene it is a more general

version of handwritten digit recognition

but it was much much more complicated

scenes and many many more types of

objects this problem was one of the

important problems that needed to be

solved for self-driving cars and

self-driving cars are one of the big one

of the big applications of machine

learning out there

in 2005 there was one like people have

been trying to build safe driving cars

since many many years and the first very

important breakthrough happened in 2005

when there was a race where cars were

supposed to drive autonomously through a

desert for a hundred kilometers and they

just got the GPS coordinates for like

they also were starting at the same

point and got the GPS coordinates of the

point where they were supposed to go and

for the first time in 2005 one single or

a couple of cars managed to solve this

challenge and from then on self-driving

cars really became much

more prominent and have been developed

and are now about to be rolled off in

many of the in for example in some

cities in the US you can already try

self-driving cars in Germany they are

not yet out there there are some third

some technically problems but there are

also some problems that come from the

law and from from responsibility one big

application area of machine learning

already since quite some years is the

field of bioinformatics there are many

many machine learning algorithms that

are applied to a wide variety of

problems for example you one of the

starting problems was you wanted to

detect different like different types of

diseases from microarray data so I'm not

a bio biologist but my understanding of

microarray data is you have a certain

cell and their proteins that might be

active or not and you have know some

kind of lab experiment that can measure

whether a certain protein is active or

not and in this little image that you

can see on this slide each of the blue

or green egg green or red dots stands

for this particular protein is active or

is not active now you want to classify

different types of cancer cells for

example based on this pattern so it's a

bit like in handwritten digit

recognition you have a matrix consisting

of zeros and one say green and red dots

and you want to say which of these

patterns belongs to a particular disease

because a cell behaves in a certain way

another application is drug discovery

where you want to say you have a disease

and you want to design a drug and to be

able to do that what you need to do you

have this protein maybe the virus and

you want to knock it out so what you

might want to do is you want to find a

small molecule that can bind to this

protein and then do certain things to

the protein so you need to first find a

molecule that can bind to the protein

and here on this slide you see an

example the a protein has a very

complicated three-dimensional structure

and it has these little pockets in a

three-dimensional structure and now you

need to find a molecule that exactly can

bind inside such a pocket and again this

would be very expensive to try in a lab

you have like thousands of different

molecules that might work might be

working but you don't want to run a lab expert

expert

and for all of them you might want to

speak to pre-screen the different

molecules and in order to do that and

you use machine learning again you have

a certain description of these pockets

it says how large is the pocket water

maybe the the molecules that sit at the

side of the pockets what are the binding

energies of all these molecules and

based on this description you want to

predict whether a certain molecule is

now going to fit into this pocket or not

this is again a classification problem

that you might be able to solve with

machine learning here just for the

people who work in bioinformatics I have

one slide that shows which are all the

different fields in which bio and in

which machine learning is used in

bioinformatics if you want you can look

at it at home going from bioinformatics

more towards medical applications one of

the very big fields also very prominent

currently in machine learning is

applications in medicine for example in

personalized medicine you want to hand

design different therapies to particular

genetic to the genetic disposition of

particular people or here I have an

example for skin to cancer detection

where the idea is you have a it's again

some kind of image detection object

detection problem the idea is you as a

person you think you have a funny piece

of color at your skin and you wonder

whether it's a it has to do with skin

cancer or not so what you do is you take

a do you take your smartphone you take a

photograph of the skin and then you use

an automatic classifier that might say

oh this is very harmful or this is this

might be harmful or this is not harmful

and then depending on the outcome you

can start consulting a doctor and the

impressive thing is that these systems

by now are at least at the accuracy of

medical experts who have really been

trained for years to detect different

types of skin cancer so machine learning

at this in this particular application

is really a very powerful tool that can

support doctors who then can focus more

there are many many more applications in

science and here I will just want to

outline one in which is a bit funny it's

an archaeology and you would think that

archaeology is maybe the last field

where machine learning could could be of

an advantage but here is a nice and

paper that has been published 1990 to

2019 so last year in Nature

communications where people have been

analyzing the human genome from ancient

findings and they found some evidence

that tried to reconstruct this

development tree in which different

different kinds of humans have have been

developing and they found that there is

there must be an additional branch in

this tree that has not been discovered

yet so we have not found any bone of

this particular branch in the human

development but it must be there because

otherwise you couldn't explain the data

that you currently have with if you

don't assume that such a trick such a

branch exists and this is I think a cool

application because it shows that

machine learning not only can solve very

specific classification problems but it

can really discover things that you

one of the fields where machine learning

is very powerful nowadays is language

processing one first breakthrough was in

2011 when the computer Watson won this

there was a question in the US that was

called yopo do you it is a bit like the

German we have a millionaire so there

are questions that have been asked and

then the persons on this case a computer

is supposed to answer and the

interesting thing here is that these

questions are more like word games it's

not so much about who won the soccer

championship in 1950 55 or so it's more

like kind of word games and as a

surprising thing was that this computer

Watson was able to beat the best

geo Purdue players at that time by now a

language processing is very very

prominent and you see you have cereal on

your phone or Alexa or you can also try

automatic translation systems like if

you haven't seen that before a deep L is

one of my favorite trance

Rajon services you paste an English

sentence and it spits out the perfect

German sentence or the other way around

and this is really impressive and this

wouldn't have been possible a couple of

years ago one last thing I want to

mention is I forego many of you might

have heard about it

so chess is an old play an old game

which has been solved by computers

already in 1996 at that time there was a

computer which was able to beat the

world champion

the world leader in chess playing at

that time gathering garlic has power

however at that time they didn't use any

machine learning for this essentially

what they did is a very clever search

procedure combined with a very very

powerful computer so essentially at this

at this time 1996 for chess they

essentially managed to look I had a

couple of steps and evaluate all the

different possibilities and the

different directions and the opponent

might have and in this in this fashion

managed to beat the best chess player

who might not have such a huge such a

huge computational power to look ahead

for say five steps now it's a very

different story with alphago when I

forego that was in 2016

did mind I managed to program or go

playing machines purely using machine

learning and that was really a big

breakthrough at the time sorry I don't

have a slide for this so what happened

is essentially you they used neural

networks to sort of represent the the

situation on the board and then they

they first fed in your network with

games that have been played by experts

to try to train it to do the same kind

of moves that experts have done and then

in the next step they led to different

systems of alphago play against each

other in order to improve and improve

and improve and in the end they managed

to beat this the world championship at

the time so now we've seen many examples

where machine learning plays an

important role but now what is machine

learning how can you define it is there

a definition at all or how could you

explain what happens in the background

of course I mean we're going to spend a

whole semester trying to discover it but

let's try to start with a couple of

if you look at what what is in Wikipedia

are in many of these online blocks they

try to explain machine learning you will

find something along those lines machine

learning is the development of

algorithms which allow a computer to

learn specific tasks from training

examples and there are a couple of words

that are really important here the first

one is specific tasks machine learning

is not or at least in my opinion it's

not about building general artificial

intelligence so you don't want to build

an agent that is like a robot that is

really intelligent as a human what we

try to do in machine learning is to

build algorithms that can solve very

specific tasks it could be skin

detection skin cancer detection or it

could be language translation what could

be to play alphago but you're not or at

least currently we are not trying to

build an agent that can do all these

tasks at the same time but whenever you

want your algorithm to do a new task you

need to train and this training for this

training typically you need training

examples so you need examples of the

tasks that you are supposed to - that

the computer is supposed to learn for

example unit in skin cancer detection

you need images of different pieces of

skin and then you need to have the labor

which says this is skin cancer and this

is normal skin now the next point is

learning means that the computer cannot

only memorize the scene examples but can

generalize to previously unseen

instances of course there would be no

point in skin detection if if you could

only show the computer the the piece of

skin that you already know what you want

to do is you want to have these training

examples to train the computer and then

later on you want to have a new patient

and this new person is going to come in

and you want to say for this new person

whether the person has skin cancer or

not and this is what we call

generalization so we train on a couple

of instances but then this rule that we

are going to find is supposed to

generalize to new instances of the same problem

ideally the computer should use the

examples to extract a general rule how

the specific task has to be performed

correctly so what happens in the

background or what is supposed to happen

is the computer takes its training

examples it has some mechanism by which

it can generalize a generator rule and

we are going to talk about many of these

mechanisms in the lecture and then

hopefully there is a new function that

comes out that is able to solve this

task in a very general way so now on a

high level this is what machine learning

is about of course this doesn't help you

very much at the current time but we are

going to see many examples in the

lecture however I still want to show you

yet another explanation and this is one

I like a bit more and it's much animal

to be able to explain you what I meant

to talk about now we first need to

figure out you know what is deduction

and induction now what you're going to

see from time to time I have questions

on my slides and if this would be a

normal lecture where people would sit in

the audience I would not ask you this

question the questions are always in in

bold font or in capital letters now as

you are watching this video at home I

guess I suggested whenever such a

question comes up you take a bit of time

you stop the video you think about the

question and then you proceed because

this would also be the same way we would

do it in a lecture and these questions

often help you to recap certain things

or to think about certain aspects of

what we are currently talking about so

at this point I would like to ask you

whether you know what deduction and

induction is and maybe you might want to

think about it for a minute and then

so here's the answer deduction or

deductive inference is the process of

reasoning from one or more general

statements premises to reach a logically

certain conclusion essentially this is

what is happening in math you say here

is statement one and two your statement

two and the first of these statements

are true then I can make a certain

conclusion from these statements and

here I have an example premise one every

person in this room is a student premise

two every student is older than 10 years

the conclusion is now every person in

this room is older than 10 years so the

important point is if the premises are

correct then what conclusions are

correct the conclusions you come to the

conclusions by the rules of logic and

you can always be certain if the

premises are correct then your

conclusion is correct as well this is a

very very nice framework of course and

all of logic is built on this all of

mathematics is built on this however the

big problem in in this kind of thinking

we need for machine learning is this

term if the premises are correct so

typically you can never make a step you

can never be certain about many things

there's always an uncertainty attached

to it and whenever a statement is not

all is not completely sure then this

kind of reasoning doesn't apply anymore

and so this is why detection is not very

well suited to to machine learning tasks

we use different mechanisms the other

principle that sort of the opposite to

deduction is induction inductive

inference is some kind of reasoning that

constructs or evaluates general

propositions that are derived from

specific examples so induction is what

we often do in science we observe many

things and we see some kind of pattern

and then we make a hypothesis and think

oh this pattern this is what is always

going to happen and then we have a

hypothesis and then then we keep on testing

testing

testing this hypothesis whether it's

true or false this process is induction

and here's an example if you are a kid

or you have a kid maybe that's more

closer to what is going to happen soon

so say you have a kid and what you're

going to see is when the kid is one or

two years old it keeps on dropping stuff

so it takes something it drops it it

takes another stuff it drops it and it

gets busy with this process for half a

year or a year and the kid is always

astonished at the thing at the end is on

the ground floor now what

and eventually the kid is going to learn

that whenever it drops things these

things are going to fall to the ground

floor and maybe not to the ceiling and

this is a process of inductive influence

you have this experiment you keep on

dropping stuff you observe that it

always falls down and then your

conclusion is that whenever you drop

stuff it is going to fall down and this

is inductive inference the important

thing is you can never really be sure

that your conclusion is correct and this

applies to all of science and there is a

lot of interesting philosophy of science

that tries to explain what does it mean

how can we learn something at all how

can we explain something and so on

because we cannot really be certain

about it humans - inductive reasoning

all the time essentially all our life is

coming up with good moods of some and in

performing induction here's one more example

example

you come late everything lecture 10

minutes so you I started the lecture and

after 10 minutes you enter the room for

the first couple of lectures I don't

really complain so you conclude well she

maybe doesn't really care whether I'm

late or not but you cannot be sure maybe

at lecture 10 I really get annoyed and

then there is something happening that

you didn't expect so here is a situation

of uncertainty in your reason so we

cannot be sure about the conclusions

that we make now why am I telling you

all of this here is now the second

motivation for what machine learning is

machine learning

tries to automate the

process of inductive inference and I

find this a very powerful explanation of

what machine learning is inductive

inference means we look at training data

for example because we always drop

things we have training data and we

build up some hypotheses and this is

exactly what machinery is supposed to do

we give some training examples to the

computer and the computer is then as

supposed to learn a general rule to come

up with a hypothesis how it could

explain future events or future examples

of the same process and the idea is that

machine learning is supposed to automate

this process so we want maybe to give

some basic framework but then the

algorithm is supposed to come up with

this rule in an automatic fashion and

this is an explanation of machine

learning that's very general of course

but I think it really explains what is

now I would like to discuss a bit why

people think the or why this can work at

all or whether it can work I mean you

see examples that it works so probably

it can work but there might be some

assumptions that we need to make and to

do this I want to consider a particular

example so here we have a particular

regression example so what we are given

is we have pairs of input points and

output points X I Y I so X is always the

input point Y is the output value that

we're supposed to have you see a plot of

some data I mean take this very

intuitive now we are going to make this

much more formal later on but for now

it's really about intuition look at the

data that is at the bottom of the slide

so we have four data points marked by

the crosses so you always see the x

value on the x axis and the y value on

the y axis and what we want to do is we

want to learn a general function that

can predict the Y values from the x

values so what we want to have is we

learn in fact we want to learn a

function that goes from the function f

that goes from the space Curly X which

is the space of all input points to the

space Curly Y which is the space of all

output points and now if I would ask you

in a lecture what do you think is the

value that you would predict if the

input value would be 0.4 so you might

want to look at this plot and think

about it a bit for yourself but I'm sure

the answer that most of you will come up

with is the following so well here you

on the x-axis you have this this K that

goes from 0 to 1 here we have zero point

four this is the point I'm interested in

now what would be probably the output

point of this one well it's going to be

roughly here and if we now assume this

is sort of a straight line the output of

this point might also be zero point four

this is a straightforward kind of

conclusion that you could draw from

these data points now you could also

come up with other conclusions and here

are two examples so the first guess this

is the one I've just explained to you is

that these data points have been

described by this or have been generated

by a linear function this kind of red

line the red line is sort of a good fit

to your existing data

then you can use this redline to predict

a new value for this for this point

you're interested in 0.4

it could also be the case what you have

see here on the right hand side maybe

for some reason you don't think it's a

linear function you come up with this

very kind of more shaped function which

has sort of goes up and down and up and

down and it also like this red function

also fits your existing data very well

but now if you would use this function

to predict you would get a different

prediction so the prediction now for 0.4

would maybe 0.8 as your output value and

the question is now which of these two

predictions is better or which of these

two red curves is more plausible and

this is now one of these points I would

like you to stop the video for a moment

and come up with it with arguments for

why the first one might be better or

maybe also why the second one might be

better what are the differences and what

might be criteria along which we could

okay I hope that you have made a couple

of ideas why each of these functions

could be better

typically what are the answers that I

get in these lectures if there is a real

audience in front of Miss many people

would say well I guess one is better

because it's a simpler function and

there's no reason if you just see the

data that we would need to fit it by

such a complicated function as on the

right hand side and so we would prefer

the drawing on the left hand side some

people would also use the word Occam's

razor because people have heard about

this before and would say Occam's razor

says you should always prefer the

simpler solution that can explain your

data and they would say this is a reason

why we should prefer the left hand side

all these things are correct up to a

certain point but we will see later in

this lecture that it's not always maybe

they are more twists to this explanation

and then there are also people who tend

to argue for the people who tend to

argue for the for the right hand side

they say well maybe

we have some background knowledge and we

know it's a physical phenomenon and this

phenomenon is not a linear phenomena but

it has is something that goes up and down

down

maybe it's the temperature at different

days of the different days and one of

these thoughts has been recorded at

night but the days are missing is now

zero point for us at daytime and

typically temperature goes up and down

during day and night so as she has this

background knowledge maybe guess two

would be better and then this might want

might lead to a better prediction than

guess one so the bottom line I want to

make here is if you don't have any extra

knowledge about your data there is no

way in which you can decide about which

of these things is really better you

need to have extra knowledge or you need

to make assumptions one such assumption

could be that the function should be

simple and then you would go to the left

hand side or the assumption could be

you're trying to fit a periodic function

and then you go to the right hand side

however you cannot make a prediction if

you do not make any assumption or make

any kind of bias which is the direction

so here's one more aspect now assume

that I tell you that the function values

have been generated randomly and if I

keep on generating REM data these are

simply random points in the unit square

so uniformly distributed and observe all

these data points are so you now see

we've been drawing many more points the

red point and now you can't see any

pattern anymore and if I now would ask

you what is your prediction at point

zero point four you would probably sell

say well in fact I don't know it could

be anything between zero and one and I

have no particular reason that it should

be zero point four for example it could

be anything else as well so here's the

inside if there is nothing that you

could predict I mean if you if you don't

have any pattern that sort of connects

the input to the output value you won't

be able to predict anything and I would

like to summarize this discussion now

the first consequence that we need to

take away from this discussion is we

will only be able to learn if there is

something we can learn in our data and

this there is something in the data

it sounds very trivial but in practice

this is often not so obvious so if you

have certain input data in for say

medicine and you want to predict a

certain output data say a particular

type of the disease and your input data

is the temperature of the person and

what the person has been eating during

the last days and the age of the person

and the shoe size

maybe this data is not enough to predict

this particular disease in this case

there is no connection between the input

data and the output data and you can try

whatever you want your machine learning

algorithm is never going to succeed and

this is something really important to

keep in mind when doing machine learning

it sounds trivial but in practice and

you might stumble into this problem very

often so the first thing the output

needs to have something to do with the

input and often an kind of bias or

assumption we make is that similar input

points would lead to similar output

values so again if you have certain

patients and you want to predict the

disease and you have very similar

patients the integra view would say well

these this these patients they behave so

similarly so probably they have the same

disease and this is the kind of inherent

rule that governs many many of the

machine learning algorithms of course

this is very abstract but still this is

what is in the back of machinery in many

applications then the next thing is

there needs to be a simple relationship

or a simple rule that can predict the

output from the input if this function

is something that it's so complicated if

your function is a fractal and you are

supposed to learn this fractal from ten

data points it is very unlikely that

you're going to succeed so the function

needs to be reasonably simple in order

for you to succeed the more training

data you have the more complicated for

instance you will be able to afford but

there needs to be some relationship you

can't learn the most complicated

function from just three data points

unless you make very very strong

sumption see the last point is where we

tend to look for a function that is

simple in some aspects we need to be a

bit careful with this notion and we are

going to see in later and statistical

learning theory

what simple really means and it's not

really I mean this is sort of Occam's

razor but it's not just Occam's razor

there are more aspects to this but we

are going to discuss it what words were

at the end of the lecture when we've

seen statistical learning series what is

not important is these assumptions that

we have on this slide they are rarely

made explicit so people run machine

learning algorithms and they press many

buttons and they try it out and they

require training and test error and so

on however you need to be aware that

these assumptions are always made in

machine learning but often it's a bit

unclear what really are the specific

assumptions that a certain algorithm

makes so always keep that in the back of

your mind when were children is being

applied there are assumptions that are

going to be brought into machine learning

learning

this has harm since I'm wrong it's very

likely that the function that you learn

is also wrong and you you might want to

be aware of what are the assumptions

that really go into your particular

the second consequence we said we are

going to look for a simple function and

so on but the more important thing is we

need to have an idea what we are looking

for and this idea of what we're looking

for is called the inductive bias of a

machine learning system as in the

previous example we need to say in

advance are we looking for a linear

function or are we looking for a

periodic function and this is sort of

our inherent knowledge on the data on

the phenomenon that we are trying to

model and this is called the inductive

bias and I want to give you a bit of

intuition what this really means I now

want to show you a simple example for

what this inductive bias means and why

we really need it and for this let me

simply draw an example so what we're

going to look at is a space that is just

one-dimensional so we have points

between 0 & 1 and the space consists of

a grid and the grid S is the Soviet 0.01

0.02 and so on so these are our input

points and now the output space is

either 0 or 1 so it can be so our

training data could be maybe I draw the

training data now in red so we could

have 1 training point here this is 1 and

so here at this particular input point

our output if you make an y-axis here

this 1 and maybe we have another point

here and here the output is minus 1/2

say okay so I now name the output 1 and

minus 1 on this slide it says the red

one so don't worry it's just we have two

different classes here say plus 1 and

minus 1 and now assume we have seen a

couple of training points so we have

seen these two red points and then we

have a couple of more points here and

this is our training data and the idea

is now we want to learn a function that

is going to predict for the remaining

data points what is going to be the

output value is it - 101 so what we want

to do is we want to learn a function the

tag that goes from the input space X to

the output space Y and now we start with different

different

in different situations in the first

case we assume we do not have any

inductive bias so any of these functions

that go from X from the space curly

extra curly Y can be the correct

function and this sounds great because

you want to say oh I don't want to

restrict my system I don't really know

what the process is that models I don't

know this this particular disease based

on the genetic information I don't have

any clue so I don't want to make any

assumption that sort of tries to bring

my algorithm into into a certain region

I simply want to have no prejudices I

want to learn without prejudices so you

don't make any inductive bias so what

are you going to do our function space

and maybe I put this here so this space

of all functions this is always going to

be denoted by curly F this is the space

of all functions F that go from the

input space to the output space all

functions now how many different

functions do we have our data space X

contains about 100 points each point can

be mapped to either minus 1 a plus 1 so

we have two to the two to the hundred

two many functions in this space so if

we would write it that lipstick absolute

value the number of functions in this

function class is two to the hundred

these are really a lot of functions so

but it sounds good so we have a powerful

function space that can model all

possible different things and we want to

learn without prejudices we now record

our first couple of data points so say

we have this red data points which are

here in the plot five data points and

assume we don't have any noise so we

assume we are in a perfect situation

that the training points that we get

always give us the correct exactly the

correct answer it's not like in a

medical case where you have some

uncertainty so we we live in a world

without any noise which is also a

simplifying assumption so now we have

seen five training points so we know for

example this particular point here is

going to be a plus one so what can what

does it help us we can now say well all

those functions in defines the space

that would assign minus one to this

space we can rule them out we can simply

throw them out because we know they are

not the

functions because we're in a noise-free

situation so this is not going to be the

correct function similarly for all the

other data points so for each of the

data points that we have we can rule out

all the functions that I that do not fit

this particular data point so what this

now means is that after seeing these

five data points our function space that

is sort of still contains the functions

that might be useful it's a bit smaller

so the function space may be I call it f

f5 after we have seen five point it is

now smaller it only contains two to the

ninety-five many different functions

okay so now we've seen these five

training points and now we want to

predict at a particular test point and

this test point like maybe I put it here

so this is the point where we want to

predict it is point X prime what are the

other possibilities that we can do which

is with this point we have now two to

the 94 many functions that are going to

predict that that the label of this

point is minus 1 and we have another two

to the 94 functions that would predict

that the label of this point is plus one

so our function space here we have those

functions that say f of X prime is plus

1 and we have functions that would say f

of X prime is minus 1 and here we have 2

to the 94 many functions and here we

also have 2 to the 94 many functions now

what are we supposed to predict for this

new data point we don't have a clue

there are as many functions for plus 1

as there for minus 1 and we don't have

any clue that would would tell us which

of these functions are which of the

functions at all is more plausible for

this particular data point and this is

where the inductive bias kicks in if we

do not have any bias there is no way in

which we can decide what is the correct

function here we need this inductive

bias otherwise we're doomed here and the

trick is this continuously now you would

say well maybe five data points are not

enough maybe I need 10 more data points

ok you take 10 more data points but

again for this new data point you still

don't know so no matter how much

data you are going to record for a point

that you haven't seen before you will

not be able to predict anything and this

is really the good point about machine

or the important point about machine

learning if we do not make any

assumption and you say we do not what we

do not want to have any prejudice we do

not want to make any assumptions we do

not want to restrict the space of

functions in which you are looking for a

solution it is not going to work what

I've shown you here is a bit of an

informal way there exists a more formal

way of stating the same result and this

is called no free lunch theorem the no

free lunch theorem essentially says

there is not a machine learning

algorithm that can always succeed and

you always need to make assumptions but

I don't want to become more formal about

the theorem here there are some slides

later in the lecture if you want to look

at that but I think for now for our

purposes this in formal reasoning kind

of works it's enough we now consider a

function space that only consists

exactly of two functions the function

constantly zero or the function

constantly one so it's always predicting

the same label no matter what is the

input point it's going to predict zero

or it's going to predict one and this is

our inductive bias so maybe for some

reason we know that the input data is

completely meaningless the output is

always going to be zero or it's always

going to be one we just don't know which

one and again we assume that there is no

noise in our system and now we observe

one particular data point and now once

we've seen this one data point we know

which is the correct function because we

know is it not we observe is it zero or

one I mean it's the output value zero

one and after observing it because we

don't have any noise I'm sorry um we

know what the inductive by by our

inductive bias we know that it's either

the function 0 constant 0 or the

function constant 1 and then we can

predict for all other values in our

space and we are always going to to get

and of course this is not a bit

simplistic so I've shown you two extreme

cases a one extreme case where we say we

don't have any inductive bias and we've

seen that we cannot learn anything at

all and the other case where we have a

very very strong inductive bias namely

we say it's just one out of two

functions and we don't have any noise so

with one training example we can explain

the world now the truth for many machine

learning applications is obviously

somewhere in the middle and now all of

machine learning or a major part of

machining consists in finding a good

function space for your particular

application now this problem of finding

a good function space is called model

selection of course these examples that

we had here is a really simplistic also

in the sense that we didn't consider any

noise and also we didn't really consider

what happens if this function space F

does not even contain the correct

function which in practice might happen

very often you say for some kind of

medical example you make certain

assumptions on what the function might

look like but maybe in truth it looks a

bit differently and ideally your machine

learning algorithm should still work at

least it's up to a certain accuracy now

figuring out all these details how these

how these things fit together like the

amount of training data the model

selection problem which kind of

functions to use what is the amount of

noise and so on this is really tricky

and all of machine learning is

essentially the science to solve all

these problems and it is one of the big

success stories of machine learning and

in particular the theoretical part of

machine learning that at least in some

standard algorithms and we have worked

out exactly how these things work

together and it's really well understood

and at the end of this lecture course

you will at least understand the rough

there are two important terms when it

comes to model selection or finding

selecting a good function and these

terms are overfitting and underfitting

and again this is sort of supposed to

demonstrate why finding a good function

class is so crucial consider the

following example here we see it on the

slide the true function is a quadratic

function so you have this parabola and

your data points so this is in the plot

it's the black function and your

training points have been generated from

this function with a bit of noise so you

see the green crosses in the plot which

are the training points that roughly

follow the parabola but there's a bit of

noise and now you want to learn this

function and now there are different

choices of your function class that you

could take you could say oh I fit a

really simple model I fit a linear line

so you make my camp with this red line

or you say I want to fit a polynomial

function of degree 20 when you come up

with this blue line and what you can see

now is they're different reasons for why

these things go wrong the red line

somehow seems too simple it doesn't even

fit the training data so and this is

what it's what it's going to be called

under fitting and the blue line is so

extreme it tries to fit each little

aspect of your training data that it's

going to be called overfitting this is

also wrong and here are the explanations

for what these are overfitting means we

can always find a function that explains

all the training points very well or

even exactly those functions tend to be

complicated and they tend to fit the

noise as well this is what we also see

on the previous plot the blue line goes

nearly through all the data points even

though there is noise on the data points

and maybe the true function is not

supposed to go through all the data

points because there is noise and we

don't want to model the noise as well

predictions on this data for these kind

of functions that are overfitting are

for up or for new data points because

this function is somewhat too

complicated and later we are going to

see that overfitting is characterized by

the fact that we have a low approximation

approximation

error in the high estimation error and

don't worry about this now we are going

to see this in the next lecture the

opposite effect is called underfitting

here your model is too simplistic you

want to use a linear function even

though your data does not it cannot be

described by a linear function the

advantage is that the estimated function

tends to be very stable with respect to

noise so if you add a couple of data

points or really a bit this linear

function is not going to move a lot but

for unseen points again the predictions

are going to be poor in the regime of

underfitting is characterized by the

fact that there is a large approximation

error and a low estimation error again

and we are going to talk about this in

the next lecture at the end of this

lecture I want to show you that this

notion of an inductive bias is not only

used in technical systems it is in you

it is used in all systems that are

supposed to learn and in particular also

animals or humans so I want to stress

again that cannot exist a learning

system that does not have an inductive

bias and we as humans also need to have

inductive biases otherwise we could not

be able to learn now we don't want to

make experiments with humans to figure

that out but they have been experiments

with animals in the 1960s which have

tried to show that and I want to explain

you what it is about so this is an

experiment with rats and now consider

you have a red and the red has a choice

of two types of water so there are two

water bowls here there are two types of

water one is normal water and the other

water makes it fusing and the rat

hopefully is supposed to learn to avoid

that water that makes it sick and only

drink from the from the normal water now

if there wouldn't be any anything by

which the rat could decide any features

that would make the show the difference

then the rat wouldn't be able to learn

that but now there is a future and in

the first experiment the two types of

water tastes differently so what you

have is you have one type of water the

tastes neutrally and the other one has

been sweetened by sugar so a test suit

and now you have the red and the drinks

and if it drinks from one type of water

it gets sick and if you drink from the

other type of water it doesn't get sick

and now the red as you can observe in

this or has been excel observed in these

experiments the rats learn very very

fast to only drink from this type of

water that doesn't taste sweet so the

rats even if you put the bowls in a

different spot and so on the red it

would try a tiny bit of the water and if

it's a if the sweet water is the one

that makes it tick it wouldn't drink it

well if it's sweet and if the red dance

is very fast okay so far so good nothing

really surprising now there was a second

experiment it's again the same set up

you have two bowls of water they can be

in different places in the cage one of

the waters makes the rat tick and the

other one doesn't

but now the difference between these the

features that the red can distinguish

about these types of water is not the taste

taste

so both waters taste the same but the

the world like one type of water is

accompanied by a certain certain sound

and light effects so say one of the road

that one of the waters is in a room that

is has a red light or there's a certain

pair that that you can listen to if

you're close to that water and in the in

the paper they wrote a write about audio

visual stimuli and so you have certain

sound and lightning conditions and then

the water like you have one type of

water which has these kind of conditions

this particular sound and you have

another type of water which does not

have this conditioner now again the rat

is supposed to learn which of which type

of water makes it tick and which one

doesn't and the surprising thing is now

the rat cannot learn it the read does

not learn the connection between the

fact that this water that makes it tick

has something to do with the lightning

conditions in your room so apparently

the read does not have an inductive bias

that could make help it make the

connection between lightning conditions

in the room and water that gets it makes

and if you think about is it plausible

or not you can of course in hindsight

come up with a plausible explanation so

if a read like it's out there in the

wild and they need to taste food and the

food tastes funny then maybe the food is

rotten and the red doesn't want to eat

it anymore

so sort of connecting the taste of food

with the fact of whether this food makes

it tick or not as something very natural

and the red has this connection sort of

wired in its brain but now different

lightning called lighting conditions in

a room typically don't have anything to

do at least in nature with the fact of

whether some food is rotten or not so

you can look at it at night or a day and

in one situation it's bright and then

the other it's not bright but it doesn't

have any influence on whether this food

makes it sick or not and so apparently

the brain of the red is not able to make

this connection so the red has an

inductive bias it simply cannot learn

this function and there's no way it can

sort of overcome this the the bias of

its brain is that it cannot learn it

this effect has been investigated a lot

in psychology it's called the Garcia

effect because it has been published by

a researcher called John Garcia and his

co-workers in the 1960s and there you

can see one of the references but there

are many more references out there so

what is now the inductive bias the

bottom line any successful learning

algorithm has an inductive bias we tend

to prefer to select hypotheses from some

restricted of more small function cases

function spaces because it helps us to

focus on the functions that are

important and whether this function is

then the function that has been learned

by the algorithm is close to the truth

really depends on whether this function

classes were selected for the problem at hand

hand

we haven't really been talking about

this at all but this is obvious like if

you have a function class you say your

function class contains linear functions

but the phenomenon that you're trying to

model is a periodic function then no

matter what you want to do what you're

going to do it's not going to work out

what it's the important message that I

want to give you now is for some

algorithms it's sort of improve it's

sort of obvious what the inductive bias

is going to be and we are going to discuss

discuss

that for other algorithms it's not

obvious but there has to be an inductive

bias machine learning is impossible

without an inductive bias and it is

important to keep that in mind in

particular if you get honey results

maybe your inductive bias is from or

even if you have results and they look

really good you might want to ask

yourself at some point while it's the

inductive bias really the wrong they

correct one or the wrong one and all

these points are going to be made more

precise in the future of this machine

learning lecture so I'm going I'm hoping

that you're going to stay with us and

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Statistical Machine Learning Part 1 - Machine learning and inductive bias