YouTube Transcript:
Probabilistic ML - 25 - Revision

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content is a comprehensive recap of a probabilistic machine learning course, emphasizing the foundational role of probability theory, Gaussian distributions, and various approximation techniques for building and understanding complex models.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

So what I have prepared now for the next

well let's see how long it takes is uh

no new content obviously but a quick

summary of everything we've gone through

at least the highlights the main ideas

um from a like not the highlights from

like novel ideas perspective but like

the foundational stuff that I've covered

over the last 24 lectures.

and I'll uh try and go through that. But

um you can also stop me at any point in

time with questions, right? So um I

don't even mind if you don't make it

through all of the slides. Something

like 20 odd. So yeah,

but that's okay.

So let me begin. We started in lecture

one with the laws of probability which

all of you have encountered in previous

lectures before um and I've already

written it down here on this slide

copied over the content from the lecture

number two where I introduced

probability densities as representations

of probabilities over continuous spaces

as the natural object which takes on the

structure that we also have in discrete

probabilities. um namely the two main

rules of probability theory, the sum

rule and the product rule. This is the

rule for getting rid of something you

don't know and this is the rule for

conditioning on something you do know

and then talking about the stuff you

don't know. Um this the two of them

taken together amount to base theorem

which is the fundamental rule for inference

inference

everywhere not just in science but

literally everywhere. By the way, when I

walked in here, I saw a a sheet here

that's from my colleague Martin Boots

for cognitive architectures and um it's

his cheat sheet that he hands out to the

students, the formal zamlong and it

actually contains base theorem as well

and then variations of it. Um so

cognitive science clearly also an

application field for continuous

probability distributions. We have to be

a bit careful. There's this minor thing

that some annoyingly complicates life

that um if you rescale the domain over

which you assign a probability then you

have to make sure that density is

correctly corrected by multiplying with

the determinant of the jacobian of the transformation.

Okay, so that's the fundamental rule.

That's just a phil that's just a

philosophical statement about how to

reason in a situation where your

information sources your data does not

fully provide the knowledge about the

variable X that you'd like to know.

But unfortunately these abstract

statements which are sort of

manipulations of functions or functionals

functionals

uh are uh effectively intractable for

the general case. If you have a general

function for every term in here, if

that's a general distribution, some

general probability distribution that

just sums to one uh for zed and for x,

then um this object here, well, you can

multiply it pointwise, but even the

integral is difficult uh to evaluate at

in the normalization constant. One way

to think about this is that um if you

have a discrete probability distribution

then the number of possible uh um states

that you have to consider here is all

the possible combinations of these uh

discrete variables let's say they are

binary random variables then the set the

number of configurations to consider

grows as two to the number of variables

right so in this case it's if x and zed

are both binary we already have two to

the two, four and if you had three,

four, five, six variables then the

complexity of this inference process

would exponentially grow with the

dimensionality of the problem. And so

the entire course from lecture three

onwards was all about making things

tractable to reduce computational

complexity so that we don't have to

consider the combinatorial set of

possible hypotheses which obviously is

not possible for continuous spaces in

any case.

And the key tool for this were Gausian distributions.

distributions.

If all variables in a generative model

are jointly gausian distributed and

linearly related to each other, then inference

inference

maps to linear algebra. And I wrote down

these initially cryptic equations which

by now I hope you have developed an eye

for. So if you have a a a prior

distribution over a latent quantity X

and observe data Y with a Gaussian

likelihood that links the data linearly

to X, then both the evidence and the

postivia are closed form expressions.

Gausian distributions. Gausians are

parameterized by a mean and a coariance,

a vector and a matrix. And that um those

this vector and matrix this point

estimate and uncertainty estimate have

an interesting structure that though

it's a long formula

amounts to linear algebra. You have to

build some matrices and then solve a

linear system of equation that we tend

to write with an inverse or a matrix

multiplied with um the vector. But as we

learned, you should really be thinking

about this as like this joint object

inverse matrix times vector as another

vector that solves a linear system of equations.

And we learned about how to interpret

these these uh quantities. So um this

object here this uh gram matrix is the

expected variance of y under the prior

actually of y minus the the mean

prediction. This is the me this is what

you thought you would see. It's your

best guess for what you're going to see.

This is what you did see. This residual

between the two. This is the scale on

which you expect this this thing to to

vary. So this times this gives us some

kind of innovation. how surprised we are

about the data we got to see and then we

project this push this towards the

variable that we actually care about

because this quantity is the co-variance

between this object and x.

Similarly the gram the posterior

coariance also has this sort of update

form that looks a bit like a sure

complement. There is a initial

coariance, prior covariance and then a

correction which involves the same

matrix inverse that's in here as well.

And the this means that adding

uncertainty in gausian models doesn't

increase computational cost because you

can reuse computational effort used to

find the the mean estimate. That's a

first insight that sometimes being

probabilistic having uncertainty doesn't

actually have to have cost overhead.

We also noticed that this is um I'm

going to talk about that in a second. Um

so then then we talked actually in the

next lecture I then talked uh quite at

length I made you sit through a long

lecture about linear algebra the tools

we use to solve problems of this form

and we discovered that there's actually

nothing new hidden underneath this

linear problem because it's just more

instances of gausian inference.

The way the path to that was to discover

or kind of mentally do this exercise

that we could be under maybe an

underlying structure is actually that a

model like this is of an exchangeable

nature in the sense that I can start

with the first entry in Y the very first DATM

DATM

condition on it and then I because of

this equation I get back a new gausian

that I can then condition on the next

datm one after the other. So gausian

inference is a form of closure. You

start with a gausian you get a gausian

data point you get back a gausian. That

means we can split up our data set into

into individual terms

scalar observations one after the other.

And if you do that, then these inverses

here are just scalar inverses. And we

never need to call some fancy

numerically linear algebra code, at

least in principle. It's still probably

a good idea to do this if you're if you

have a decent sized data set. But in the

corner cases of very large data sets and

sometimes also very small data sets, it

can be useful to realize that the

algorithms we tend to use to solve such

linear systems of equations, something

like this. So an algorithm that consists

of a matrix decomposition first that

returns a solver like object a function

that we can call on right hand sides.

Algorithms like this are just methods

that do gausian inference in a

sequential fashion.

For example, this is true for the QR

decomposition, for the LU decomposition,

for the Chileleski decomposition, for

conjugate gradients, for precondition

conjugate gradients, for lans algorithms

for the decomposition of matrices into

SVDs and so on and so on and so on. All

of these algorithms iteratively project,

that's this policy up here, the rows or

columns of our matrix A

or A transpose

um onto another onto the the current

postivia of a Gausian distribution to

construct a set of um conjugate

directions u using the Graham Schmidt

process or modifications thereof. of to

um uh construct basically a a a data

structure in memory that can that is

either numerically stable in the case of

methods like the QR composition, LU and

chlesi and so there the policy is chosen

such that everything like neatly aligns

and there is very little overlap between

the directions or they choose the

direction such that they provide maximal

information about some right- hand side

B or about the solution for some right

hand side B as in the case of conjugate

gradient or lush iterations or so on um

at the cost of numerical stability then

but at the at an increased gain in

convergence. Why was this useful? Well,

it's one direct tangible outcome of this

is that we can actually think of linear

algebra methods as a form of data loader

in the Gaussian case. They allow us to

pick informative views onto the data

set. And that then might mean that we

don't have to load the entire data set.

You can stop the computation after some

well reasonably short computational

effort compared to solving the full

problem and then stop the computation

and provide an uncertainty as an output

of this method that actually takes the

fact that we've only loaded parts of the

data into account rather than guessing

what the uncertainty should be.

So over the course of these lectures 1 2

3 four actually no three four and then

also five we kept building a software

library along the side that I don't have

on the slides um that um

implements this gausian mechanism in a

abstract data class effectively with a

bunch of methods and I did this just a

second on purpose to highlight that

these gausian come with all this

functionality and as soon as you know

something is gausian you can apply all

of these mechanisms directly. Now

there's your question

directly related to

the fact that you can stop or is that

two benefits that are just >> huh?

>> huh?

>> Okay, so that's a good question. The

question is does is the I said something

about being able to stop early and about

thinking about a data loader and how are

the two like are the two related to each other

I would say I mean the reason to stop

early is that you think that you've

learned enough right so in principle

any so okay maybe the formal statement

is any of these solvers of this form can

be stopped at any point in time

including after zero iterations then you

This is for example also true for the QR

de composition or tleski de composition

and this is maybe

useful to point out on its own because

in uh sort of numerical math lectures

you will you you will typically hear

about methods like the cheski de

composition the qr de composition as

so-called direct algorithms. So people

will say that these are methods you have

to call and then you have to stand back

and wait for a cubic amount of time and

then they'll give back the true

solution. But in fact you can stop these

methods early. It's just that they tend

to have on average in the average case a

very slow convergence. There isn't

really a reason to expect them to

converge quickly. And so that's probably

why TB people want to wait for them to

complete. In fact, there's a theorem if

you really want to read up on this. It's

unrelated to the exam though um in the

it's actually in the appendix of the HDI

paper from 1950 odd something um which

introduces the conjugate gradient

method. So the invention the inventors

of the conjugate gradient method they

pointed out that conjugate gradient was

necessary because these methods converge

slowly and they have some worst case

construction showing that if you run one

of these direct methods like qr or lu or

uh chleski dempositions

then you can pick the linear systems of

equations such that the error on the on

like this mean that comes out of this

method increases in every single single

step of the method gets worse and worse

and worse all the time until the very

end and then in the very last step

everything converges and you get a

perfect correct answer and that's not

true for these more iterative methods

like conjugate gradient and uh lansour

SVDs and so on and so on um

and so therefore people tend to think

about this as different things right or

actually maybe maybe also as connected

things so if you have methods that tend

tend to not be good for a long time,

then why run them only for a short

amount of time? But if you think of

these methods as quantifying

uncertainty, then it becomes maybe more

natural to think about them as stoppable

at any time because then what's going to

happen in a situation where the

algorithm just doesn't learn anything

from all the iterations is that uh the

uncertainty won't collapse either. So

you'll just have a long a big arrow bar

that just stays pretty much almost

constant for a very long time and then

suddenly collapses to zero.

Yeah. So if you have a data loader that

is useless that always always loads the

worst use less least informative data

then maybe it's not a good policy to

stop early. But if you happen to have a

data loader that is very efficient at

loading the right data then you might

want to stop early and conjugate

gradient for linear gausian problems is

an example of a very efficient data loader.

loader.

So what do we do with these tools? Also

with the software tools that we

developed, they became the foundational

framework for maybe the most important

task in machine learning even to this

day at least as an abstract task. Namely

learning functions,

the regression.

We initially looked at real valued functions

functions

where we began with a traditional

classic perspective on parametric

regression. So the assumption is we have

a supervised machine learning problem.

There will be pairs of inputs x and

outputs y. We um make an assumption that

the function that links x to y has a

parametric form. It is given by some

features of the x. So a function fi

evaluated on the data x weighted with

some unknown weight vector w. That's a

generic form for formalism. This f ofx

we could maybe also think of as a deep

neural network or we could just think of

it as a bunch of features you pick by hand.

hand.

Then in such a situation the important

key observation here is that the weights

enter linearly in this problem. This may

well be a very nonlinear function of x

but it's a linear function of w and that

means we can directly without changes

apply this gausian formalism. We put a

gausian prior over the weights. um the

that implies by the way by some of the

properties of gausian uh also a gausian

prior over the function values. Then we

use a gausian likelihood for the

observations. So that makes the two key

assumptions gaus and linear

relationship. Then we can look up

everything that was on the previous

slides and find that the posterior over

the weights given the data is a gausian

which involves these linear algebra

expressions for which we can use these

>> Yes,

>> that's actually a very good observation.

It's a standard assumption in regression

to say that the noise on each datim is

always the same and also the data

independent of each other. So

technically for what's here on the

slide, this is not necessary.

You could have a joint

complicated covariance matrix here. So a

full uh let's say there are n data

points. It could be a full n byn

symmetric positive definite matrix.

such a such a model has the fancy name a heteroscadastic

heteroscadastic

problem. So one that has different noise

on every datim and in fact the data

points are related to each other.

So in textbooks you tend to see this um

assumption with a scalar noise.

So the reason for this I think is so

there are there are certainly

applications of regression where you

actually have to have have to model

different noise on every data.

Um the the there are very few reasons

why you might computationally have to

use something like this a structure like

this. It's still very common. So maybe

one first observation is that if we had

data that has a joint full rank

nondagonal coariance

then uh we could use a singular value de

composition of the coariance matrix to

project our data set y onto a new data

set that is independent of each other at

least right with independent observations.

observations.

um those would then

uh still have a diagonal coariance but

we could rescale the observation so that

they have a scalar coariance. So in some

sense this is not a strict restriction

because you can take any data set and

remap it such that it has this form but

then it might lose a lot of the physical

interpretation right not necessarily

measurement anymore. The other thing is

that it's um I mean in statistics not in

probability theory but in statistics the

iid assumption is like very fundamental.

So the assumption that you get a lot of

data of the same type iid from some

source that's important to make you know

all sorts of theoretical analysis to

show convergence rates the stuff you

heard from him across the corridor. Um

and so this amounts to this making this

assumption that all the data are

distributed independently of each other.

That's what the one stands for. And that

they have the same distribution. That's

Okay. So if we do this then we get a

framework with which we can learn

functions. And we spoke about these

models for quite a while. I highlighted

that you're very free to pick the

features here. This feature set fi can

be literally anything you want to pick

including functions that are

discontinuous in even including

functions that are technically

unbounded. I mean you could even have a

framework where some of these f return

nans or infinities and then you have to

be a bit careful about how you implement

the inference but it's still possible to

do this like correctly. Um

and that's simultaneously maybe a

blessing and a curse. It means you get

if you really know something about your

problem, you can build very powerful

very well like designed constructed

algorithms. For example, I know of

applications, some of them are published

in major companies where engineers

invested a large amount of time finding

good features. So for a while a common

example uh that people talked about in

the community like a decade or so ago

was when Facebook became big um they uh hired

hired

several people out of another company

who became for for example Kuna Candela

who was from Tubingan as a postto

originally and then he became the head

of AI uh at at Facebook and he built a

massive tool for ad placement on

Facebook which some of you may have

suffered through because you were the

subject of it, right? If you were ever

were on Facebook. So, it's the the me

the the algorithm that decided which ads

to show and uh to get that to work had a

big team of engineers and they built a

software platform that allowed the

engineers to trial features. So they

could provide a new example fi something

like I don't know I'm looking for

customers who are uh who recently

announced their wedding

on Facebook right there some type of

like I don't know some some way of

posting that because that probably means

you're going to buy some very special

stuff right or something about babies or

about changing your relationship status

and so on and they had a tool for people

to trial these features so they could

get like 1% of the of the population as

an AB test setup and then they would

continuously compute the evidence where

is it it's going to be on the next slide

one afterwards um but it's actually I

mean it's basically this expression up

here right where the A's now include the

F check how well this works as a

predictive performance tool uh for

whether people would click on some ad

and if it worked well it would it would

automatically load those features and

add them in that was pretty deep

learning but it wasn't an approach

that's scaled to very large user bases.

You could do that with a few hundred

million people uh because you you know

get to like manually design this kind of

process. But of course it's also

like sort of disappointing as a designer

if you want to build a learning machine

something that automatically decides

what the how how to like best extract

information from a data set. If you then

have to sit down and manually decide how

this how this machine should do this.

And so people in machine learning have

spent well one of them like main trends

basically including deep learning is the

attempt to automate this process in

various ways. And so over several

lectures we discussed how to automate

that process. A first very powerful and

historically maybe the like for a while

very successful approach was to say what

happens if we increase the number of features

features

towards some asymtotic limit of

infinitely many features in some sense

we have to be very careful of course how

we do this how we do this limit process

of infinitely many features um um we did

this we discovered that there is this

basically it's on the previous slide

right there's this interesting structure

here that to compute these predictive

distributions at least if we think about

them as function values then there will

never be a lonely phi in this

expression. So if you have if you make

predictions not about the w's but about

the function values then you'll get here

a five sigma 5 sigma 5 sigma 5

expression and up here as well five

sigma 5. They are always inner products

and so inner products are sums and some

sums even work if they have infinitely

many terms because they have analytic

expressions that don't require you to

actually go through the sum. You already

know this from you know introductory

math lectures about series.

And that led to the idea of kernel

machines, algorithms which replace these

inner products with functions that can

directly be thought of as evaluations of

inner products in infinite dimensional

in general feature spaces. And the

associated notion in probability theory

is that of a gausian process, a

nonparametric model that has in some

sense an infinite number of degrees of

freedom that it can fit to a data set.

So gausian processes I I removed the

code here um from a computational

perspective can be thought of as a lazy

evaluation of a gausian model. So you

write a piece of code that basically

waits until it is given a data set and

once it's given this data set that's a

finite data set it constructs a finite

matrix a kernel G matrix um that's this

matrix here or actually it's this whole

thing solves this finite

um system of equations linear system of

equations and then stores the resulting

values which are vector of finite length

And the matrix deco composition of this

covariance which is a uh some some

matrix decomposition of finite size

might be for example a chleski or

something else updated using the

algorithms that we've encountered before

um and then provides access to a

function a kernel and a mean function

over here which can then like basically

act as a closure for a new gausian process.

process.

We studied these algorithms for quite a

while. We discovered that they had um

very interesting properties. For

example, we like did an example for constructing

constructing

uh gausian process prediction on a very

relatively simple but structured

one-dimensional data set this keing

curve of CO2 curves. We discovered that

you can actually use different kernels.

Um there's there's a few kernels and in

fact the so there's some kind of

standard set of kernels from which

people tend to choose things like the

matern kernel periodic kernels um the

matern sorry the matern family of

kernels which includes as a limit case

the square exponential a very common

kernel that is in some sense quite

deeply flawed but also the veno process

integrated veno process and so on and

actually there is more than just this

finite collection of kernels because

kernel you can construct new kernels

from old

in by uh through these four properties

of uh kernels. So scaling a kernel gives

back a kernel. Uh transforming the

inputs of a kernel in an arbitrary

fashion as long as you apply the same

transformation to both inputs gives back

a kernel. Adding kernels gives a kernel

and multiplying kernels gives a kernel.

So they form if you like a semi- ring of um

biariate functions that are all positive

definite and this these uh properties

can be used to again build maybe more

flexible models than we had for the

parametric form but again a relatively

expressive language in which designers

and engineers can encode knowledge about

specific problems that they are trying

to to address and I find it very

important to point this property out

because people who who try out gausian

process models for the very first time

on a particular data set tend to think

of them as some kind of black box. The

standard setting is that you someone

gives you some data set in a

professional environment. So you open up

some scypi uh or scikitlearn sorry uh

implementation of gausian process

regression that uses some standard

kernel don't think about it and then it

didn't doesn't work particularly well

because the standard kernel which is

usually a gausian kernel is not a good

choice and then people say oh so I tried

a gausian process it didn't quite work

so well so I did something else and then

they spend a lot of a lot lot more time

trying to fit some large deep learning

model um and because they get invested

in it they then make it work So methods

like this still work well on many data

sets if you are careful to design them

right. A part of this process of

designing them right is to pick smart

kernels that describe what you are

trying to do. If the data set is

sufficiently low dimensional then this

can be checked by plotting prior means

and variances and samples from the prior.

prior.

But then you're typically also left with

a bunch of parameters in these models.

So parameters here could be length

scales of kernels or the individual

strengths of individual components in

sums of kernels. And um I pointed out in

the lecture on the example for how for

uh how to build gausian process models

that such parameters can be learned in

gausian models and that's actually a

first instance of a quite generic

framework which is maybe at the boundary

between probabilistic/basian

inference and more statistical analysis

namely marginal likelihood estimation or

maximum type two type two maximum uh

likelihood estimation. This uses the

insight that in base theorem the

normalization constant the evidence this

thing down here

for um is can be thought of as well

that's actually what the term evidence

is supposed to highlight can be thought

of as a likelihood

not for the latent quantity that we're

trying to compute that's up here but a

likelihood of higher order for the model

the generative model over the latent

quantity and the data. So in this case

for theta

and so just like we can use this first

order likelihood and it's prior to

computer posterior over f we can use

this likelihood to um in a in a in a

mechanism to estimate theta.

Typically this likelihood will not be of

gausian form um and it might not even be

tractable in general models but for

gausian it is. uh but then we can try to

find maximum likelihood estimates or

maximum apostroori estimates by mult

multiplying with a prior for theta for

gausian this happens to have analytic

form. So uh one of the properties of

gausian distributions is that the

evidence this distribution over y given

only the model when we marginalize out

the f the unknown function or the

unknown weights actually is itself of an

analytic form by analytic form here

means it's it's the expression of a

gausian probability density function for y

evaluated at some mean and some

covariance that is a function of theta.

So as a function of theta, if you think

of this as a likelihood, it's not a

gausian function in contrast to the

weights for the linear model, but it's a

function that we can analytically

evaluate for any value of theta

and that means we can optimize for it.

So we can write down particular the

logarithm of this probability

distribution which is this expression

has a very specific form. It involves

this square error, weighted square

error, and then a complexity penalty,

something that rises with the

determinant of the coariance matrix,

which is a measure of entropy. Um, so

like how much volume is in your space of

hypothesis, how many hypotheses you're

uh entertaining.

And uh this expression as a function of

theta we can implement today at least in

automatic differentiation frameworks

feed into an optimizer and then optimize.

If our um model for the function in here

and in here contains features

that uh we can sort of parameterize in

the form of a differentiable program

like a deep neural network. Then this is

actually one way of learning a deep

neural network and you can tell that

it's slightly different from the

standard way of learning deep neural

networks which only involves this term

but not this term.

For that we just need to be able to get

this lock determinant.

So that was the generic framework for

learning functions. And then one of the

things we discovered is that if this

framework at least if you implement it

using numerical linear algebra rather

than stoastic gradient descent

tends to be expensive because the size

of this matrix in here grows

linearly with the number of data. So if

y is of length n then this matrix is of

size n byn.

And that means at least in the worst

case if you want to build a user matrix

decomposition of these direct forms that

I mentioned before then the cost of

constructing this solution will grow

cubically with the number of data

points. Um that's in contrast to the

parametric form

the the this one where there is a matrix

in here at least there's one way of

writing this problem that that is uh

constructs a matrix that is of constant

shape in this whose size is given by the

number of features we are considering

and so that matrix only contains terms

in here that involve a sum over the

features. So that means that sum grows

quadratically with the number of data

points but the overall like this

inversion problem here um stays constant

over time. So with cubic in the number

of features so this is a reason why uh

people used to argue that gausian

process models don't quite work for very

large data sets. This has changed a

little bit with the advent of better

numerical linear algebra. There are now

tools like GPI torch that allow gausian

process inference for up to like a

million data points. Maybe not for like

internet size data but already quite large.

large.

Nevertheless, there are some data sets

that are just even larger. In fact,

there are some data sets that are

effectively infinitely large. That's the

setting for online learning where you're

sitting in front of a data source that

continuously spits data points at the

algorithm. For such settings which are

also known as time series we need

algorithms that have

a complexity computational complexity

that is constant per datim which means

it's linear in the size of the data set

and the it turns out that the gausian

linear case actually has a special

variant designed for such settings which

also builds the foundation

for inference in time series

and such models are called marov chains

or marov marovian models. They um make

the assumption that the dynamics of the

data set can be described in terms of

some state space that contains variables x

that has the property. So being a state

space means that everything that happens

into the future is conditionally

independent of the past if you know the

state. That's what this mark of property

says up here. And then if you

additionally make observation make the

assumption that our observations at any

point in time only depend on the state

only on the local state of the system

then that's this additional observation

then we found through some quite tedious

derivation that in such settings there

is a generic algorithmic structure for

which we have don't have to make further

assumptions about the shape of the

probability distributions that um

provides a local predictive update. So

you can locally predict the future from

the past. That's called the Chapman

colmograph equation. It has this

particular form. It's effectively a recursion.

recursion.

Then a local update using base theorem.

And those two together are the algorithm

for continuous prediction in time

series. And there's a special setting

which is sometimes confusing because it

like I also did tends to be taught alongside

alongside

the prediction update loop which is that

if you happen to have a finite data set

of some long length then once you're at

the end of the data set and there's no

more future to come you might want to

start considering what happened in the

past and reconsider your old predictions

in light of what happened later and that

involves a process of passing backwards

through time which is called smoothing.

Smoothing really only makes sense if the

data set is finite because otherwise we

end up with a quadratic cost a super

linear cost in time. All right? So

because as as we get more data points we

have to send messages backwards through

time to ever more data points and that

doesn't scale.

But the forward prediction is always

>> Yes. So um so the question is why don't

I do this in a windowed fashion? Yes,

you can absolutely do that. So if you

have like a tail of the last

B observations that you drag behind you,

you can always do a a smoothing pass

with a limit towards B. Right? So you

say you I'm dragging the last B

observations behind me for which I want

want to have smoothed observations and

then that's I think well if you want to

be sort of mathematics mathematician

about that you could say that's like

making the state space larger and then

you're again just filtering and not

smoothing like you're like increasing

the size of the state space by size B

basically or yeah it's it's really it's

just not so so the point of smoothing is

that afterwards you get a consistent posterior

posterior

that amounts to a joint distribution

over all the states and that you don't

get in a windowed fashion right that

sort of breaks this idea.

So um

uh yeah so uh uh here's another equ

another slide that doesn't help much

that just shows that this these forward

backward loops um exist and so that's

again an abstract statement. We actually

also discovered that this abstract

structure can be uh like there's a very

efficient like interesting structure to

it. So this this this update here

actually the joint of these two. So

first this then this. If you write this

as one equation um happens to be have

the the property of a um of an

associative scan of a uh yeah

associative uh what's the word for that?

The other the other word well

associative scan. That's the one one

technical term for it. And that means on

parallel hardware

if you have a finite finitely long data

set the cost of this can actually be

dropped to logarithmic complexity in the

length of the of the data set the the

the time series assuming you have an

unbounded number of compute threats

available. But that was just like a

interesting side observation because

this is a relatively new observation

that only begins to be used now in implementations.

implementations.

Um then now the final thing I want to

say about um um these these these markup

models is that of course these are again

abstract this is just abstract

equations. So it's in some sense not

particularly concrete but if we again

make gausian linear assumptions.

So if we assume that these three

distributions that we need to define the

mark of chain the initial distribution

the predictive distribution and the

observation model if those are all

linear gausian. So if they're all of

gausian form and the relationships

between variables involve linear maps

and then the the names of these

variables have canonical um meaning then

then

we arrive at a tractable algorithm that

is the foundation of signal processing

called the kalman filter and the

associated smoother is called the tribel

smoother and they have very classic

forms. Um it can be written like this

for the smoother like this for the

filter. This is not the only way to

write them. There are other ways of

writing them. For example, in terms of

the inverse coariance that's called

information form or the update here can

be written in a slightly different way

but they all end up being the same

algorithm. Um and that algorithm can be

used that class of algorithms can be

used in a very general set of

applications of uh things changing

across time. Basically the which

includes the modeling of dynamical systems.

systems.

If relationships between such variables

across time are not gausian and not

linear then what people in this signal

processing community the original like

the foundations of control theory and

rocket science and so on which are very

very well studied they then what they

then tend to do largely is to linearize

in some form or another. So to project

back from the nonlinear shape onto an

instance of Kalman filter and smoother

where we now basically create maps from

the nonlinear model to these particular

parameters to A and Q and H and R.

This as I pointed out actually includes

interestingly a large class of solvers

solvers

simulation methods for dynamical systems

driven by ordinary and partial

differential equations.

So solvers for ODEs

actually can be thought of as Kalman

filters. This the connection is

particularly precise for one class of OD

solvers. They're called Nseek methods.

But actually the majority of ODE solvers

are in some form like sometimes more

sometimes less direct instances of Kman

filters including gumakuta methods and

other methods that you may have

encountered in um numerical analysis

classes. That's useful because it it

allows us to connect information from

empirical data observations Y and

information from mechanisms knowing that

the system follows a differential

equation jointly.

Now we're halfway through the term and

I'm also halfway through the lecture. So

now I could take a break um unless

No, there's no question. So, this was

the Gausian part of the of the course

and then after the break, we're going to

uh swap to the

approximate non-gausian part of the

course. So, this was um

the the well roughly first half of the

of the term and of the course. I

remember that back then you you kind of

it felt to me like you were getting a

bit antsy about all this gausian stuff.

It's like when are we going to move away

from gausian? I understand this. I also

hope that now in hindsight now that we

are we are at the end of term you may

understand why I hammered on so much

about gausian. The reason is that as an

underlying structure mapping inference

to linear algebra they stayed with us of

course for the rest of time and whenever

we did something that's non- gausian we

ended up somehow mapping it back to gausian

gausian

because then things become tractable

again and we started this journey very

explicitly trying to expand away from

gausian and said what what kind of if we

allow ourselves like knowing that we

can't allow out general foreign

probability distributions over

continuous spaces because they give

intractable posteriors.

Is there something larger a more

powerful general class than just the

gausian the parameterized family of

distributions that have a mean and a

covariance in this sense. Um

beyond that that still allow us to do

something tractable and we arrived at

this concept of an exponential family.

An exponential family is a family of

probability distributions that can be

parameterized by a set of parameters W.

It has this form. So, and this is at

first again a bit confusing or like

something you have to stare at for a

long time. But actually every expression

in here has a very specific role. So,

this is a probability distribution that

is the exponential of that gives us the

name an inner product between something

that only depends on X and the

parameters W. So this is clearly linear

Those things that are nonlinear in X and

only evaluate X are called the

sufficient statistics.

And then a function that gives us that

ensures that this expression is when you

integrate over X is equal to one. So a

normalization constant, a log

normalization constant called the log

partition function because it's not

actually a constant in terms of W. It's

a function of W, but it's a constant in

terms of X plus a base measure,

something that doesn't involve W, but

just kind of gets added on literally.

So, sufficient statistics, natural

parameters, log partition function, and

carrier measure or base measure are

names for this. So we um the first thing

for which I don't have a slide that we

dis discovered is that many of the

classic distributions you will have

encountered in a statistics task like

the Bernoli the beta the dishly the

gamma the exponential the laplas

distribution the goombble distribution

the compat distri whatever all of these

distributions that you have encountered

in stats classes are of this form

including the gausian actually and they

just have different sufficient

statistics and associated natural

parameters. So that's maybe interesting.

But why is this? Well, we in a sort of

backhanded way we discovered that this

particular algebraic structure

and gender some really useful functionality.

functionality.

The first one is that all exponential families

families

have a conjugate prior.

So they allow us the construction of a

closure from the computer science

perspective. So where there is another

distribution. So for an exponential

family over the variable x given

parameters w there is another

distribution over w with new parameters

alpha and mu which is also an

exponential family. It just has

different very specific sufficient

statistics and a potentially unknown log

partition function

such that when you multiply these two

together as a function of W. So that's

what you do in Beijian inference, you

get another function of W that has the

same algebraic form. So it's another

member of this exponential family.

So making observations the the the

process of Beijian inference

is an operation under which this

exponential family is closed when basian

inference is identified by this likelihood.

likelihood.

And in particular we can also make

predictions about future data points by

marginalizing over the data using the

structure of exponential families. And

we discovered that what we have to do is

to effectively evaluate a difference

between two uh log partition functions

of the conjugate prior. So that means

that exponential families endow

endow

a a um every exponential family endows a

conjugate prior that spans a space of

probability distributions that are

closed under observations with our

original exponential family as the likelihood.

likelihood.

That's powerful because it means that

there is some kind of structured space

of distributions in which we can do

efficient inference. And what does

efficient mean? It means that we just

add up sufficient statistics and count

how many observations we've seen. That's

the operation we need to do. And that's

also why the word sufficient statistics

arises. That's what we need to do to do

inference, right? We just that those are

the only things we need to store is the

number of observations and their

sufficient statistics which is a

constant inmemory operation.

so can't the sufficient statistics be

arbitrarily complicated? Well, so the

sufficient statistics are identified by

the original likelihood, right? Right?

So they're in here and so they remain as

complicated as we choose them in the

original likelihood. So if this is a

gausian for example with gausian rx

exponential family then this ph of x is

um x and xxrpose.

So the empirical mean and empirical

coariance if you like up to normalization

normalization

um and then that's the stuff you need to

store. If it's uh

a categorical distribution

then the conjugate prior is a dish and

five of x is literally just x. So that

is easy but of course there you can

think of other exponential families that

for which this might be harder and in

fact we encountered some in deep

learning later on in the instance of

linear models where this object is also

completely interactable.

So in addition to full bashian inference

as a closure in uh exponential families

we get even more. We get all sorts of properties

properties

of the distribution

um that are commonly of interest in

um information theory, information

geometry, um statistics

like expected values of the sufficient

statistics like the Fisher information

matrix like the entropy of the

distribution, the cross entropy between

two members of the exponential family,

Kulber divergences between the two

members of a family, the score, the

expected score and so on and so on. We

get all of these

in terms of derivatives of sometimes

higher order of the lock partition function

function

up to the base measure and the base

measure the carrier measure might be

zero and then that doesn't matter at

all. Um so these things here on the left

they tend to be integrals. So they tend

to involve expected values

but we can evaluate them without doing

integrals using automatic

differentiation of the log partition

function. So I sort of quipped back then

that if you have an exponential family,

you could write a generator for the

Wikipedia article that populates all the

fields of the Wikipedia article for your

probability distribution by using these

properties. Basically symbolic

differentiation of a log partition

function tells you everything you need

to know. And this may be historically

the reason where all of these

distributions came from because the

people who invented them needed to do

Bashian inference. Whether they called

it that way or not doesn't matter. or

they needed to do lubian inference or at

least map estim estimation and then

quantifying confidence using all of

these properties. And so they ended up

with probability distributions in which

they could do these computations

efficiently using you know derivatives

of the log partition function.

But that's only true if we have the log

partition function. So a positive way of

reading this is if someone gives you one

integral the partition function which is

effectively an integral then you can get

everything here for free. But the other

way around you could also say to get all

of this you need to be able to integrate

over the sufficient statistics.

And so what happens if we're

encountering models

in which that are maybe probability

distributions of some form that can be

thought of as linear in some parameters

but nonlinear in the inputs such that

this object which is given by let me go

back again um like the logarithm of this

integral is not tractable

in such a situation. arises

for example in machine learning in

supervised machine learning if we do

classification or regression with a

general linear model. So we have inputs

X of which we evaluate some features

that we have either manually chosen or

that an optimizer picked for us by

parameterizing uh these features and

then following some gradient descent

routine or some optimization routine. So

that we get a general set of features

here and we can't compute this

normalization constant then. So for that

we needed our first instance of an

approximate inference method

and the one that I proposed which is

maybe the most daring but also the

fastest and maybe most general approach

is that a plus approximation.

So that's the very first tool in your

toolbox for approximate inference which

is based on a differential analysis of a

probability distribution. And with the

differential I mean that it's a local

description of the distribution in terms

of a point and its geometry

gradients hessions jacobians around that

point or at evaluated at that point and

this works as follows. So consider any

model that amounts to um like where

where you find an estimate of the

parameters through some form of

empirical risk minimization that might

involve a regularizer. It might not

involve a regularizer. Here's this the

equation that I used for the deep

learning lecture which is an instance of

empirical risk minimization. But we

could think of the very same situation

for exponential families that we take a

logarithm of an exponential family.

there will be a phi of x and then a w in

here. So now theta is equal to w and

then um we might this regularizer might

amount to this uh to this conjugate

prior right which is another function of

w that is um here and now we need to

find a postivia

over theta. we first realize that

typically the this min this minimization

problem that we write down here actually

is a log posterior. So for deep learning

for example the the this these loss

functions that people tend to use are

for example this L2 loss or the log loss

cross entropy um these are all instances

of logarithms of likelihoods categorical

likelihoods gausian likelihoods

and these regularizers that people tend

to use like um L2 are logarithms of

other distributions for example gausian

and so we can think of this expression

as a unnormalized lock posterior distribution.

distribution.

And now what we can do to get an

approximation to the associated

posterior is we find a mode of this

loss. So a minimizer or a maximizer

depending on how we put a minus in or

not either maximize or minimize

to find a theta star.

That's our point estimate.

And then at this point

we do a second order approximation of

the loss. I should have moved the slide

a second earlier. So this is actually

what I just said. So consider a loss

which you know to be a lo posterior or

maybe a negative log posterior. Then

find a minimizer of this loss at this

point in the loss. Eval like that's a

minimizer is a point where the gradient

is zero. do a tailor expansion to second order.

order.

We assume that we can do that because

the loss is sufficiently regular. So

that's the only constraint we actually

have. We have to assume that this thing

exists that we have found a point where

the gradient is zero. So there is a mode

and then at that point we can evaluate

this object. So that's the second

derivative of the loss the hashen in general.

general.

Then if we look at this expression we

see that it's a quadratic form. There's

no linear term because the gradient is

zero and there's a constant term in

front. We have deliberately dropped

higher order terms. This there's a like

a decision that we take to drop these

higher order terms. And then if we since

we know that this expression this on the

left hand side is the logarithm of a

posterior then that means we can the

posterior is the exponential of sorry

minus this expression.

So then on the right hand side if you do

the same thing if you take an

exponential of this we get a constant

and then times e to the minus 12 quadratic form and that we can identify

quadratic form and that we can identify with a gausian distribution centered at

with a gausian distribution centered at the mode with uh okay yeah I copied that

the mode with uh okay yeah I copied that over that's not good. So this is

over that's not good. So this is obviously the inverse of this um the

obviously the inverse of this um the inverse of the hashen.

inverse of the hashen. And now there we go. We have a gausian

And now there we go. We have a gausian and we get a gausian distribution

and we get a gausian distribution purely using differentiation. There's no

purely using differentiation. There's no integrals being solved, no random

integrals being solved, no random numbers being drawn, no um what fanciful

numbers being drawn, no um what fanciful partial evaluation of some integral as

partial evaluation of some integral as we're going to have in a second, but

we're going to have in a second, but instead just automatic differentiation

instead just automatic differentiation paired with optimization and linear

paired with optimization and linear algebra which are tools that are

algebra which are tools that are commonly available in like across

commonly available in like across machine learning, differentiable

machine learning, differentiable programming including deep learning.

programming including deep learning. So, Laasa approximations are a nearly

So, Laasa approximations are a nearly universal tool to construct a gausian

universal tool to construct a gausian postivity on anything including on

postivity on anything including on exponential families but also on entire

exponential families but also on entire deep neural networks. And then I did

deep neural networks. And then I did these like one or two or three lectures

these like one or two or three lectures on how to apply this mechanism to deep

on how to apply this mechanism to deep neural networks as maybe the most direct

neural networks as maybe the most direct bridge from the classic theory of

bridge from the classic theory of probabilistic machine learning to the

probabilistic machine learning to the more modern

more modern or currently more prominent more hyped

or currently more prominent more hyped dimension of uh different differentiable

dimension of uh different differentiable programming deep learning. So you can

programming deep learning. So you can actually endow any deep neural network

actually endow any deep neural network with an associated gausian process

with an associated gausian process posterior by taking the deep neural

posterior by taking the deep neural network first training it in any which

network first training it in any which way you like. That's step number one.

way you like. That's step number one. When you do that you minimize an

When you do that you minimize an empirical risk typically a regularized

empirical risk typically a regularized empirical risk but you don't even

empirical risk but you don't even necessarily need a regularizer. You just

necessarily need a regularizer. You just find a mode of the loss.

find a mode of the loss. Then

Then at the mode we make the like mental

at the mode we make the like mental assumption that we can linearize the

assumption that we can linearize the network the deep neural network f which

network the deep neural network f which is a biariate function of inputs and its

is a biariate function of inputs and its weights as an evaluation of that

weights as an evaluation of that function with a fixed set of parameters

function with a fixed set of parameters namely the trained weights that's still

namely the trained weights that's still a function of the inputs that's your

a function of the inputs that's your trained deep neural network plus it's

trained deep neural network plus it's Jacobian with respect to the weights

Jacobian with respect to the weights times a residual to the uh trained

times a residual to the uh trained weights which we can think of as a

weights which we can think of as a feature map

feature map um created by the network using

um created by the network using automatic differentiation. That's just a

automatic differentiation. That's just a conceptual step. We don't we don't

conceptual step. We don't we don't actually implement anything yet. But we

actually implement anything yet. But we realize that this function we have it's

realize that this function we have it's called the trained network and this

called the trained network and this function we can easily get in a deep

function we can easily get in a deep learning framework because it's just the

learning framework because it's just the thing you need to do gradient descent

thing you need to do gradient descent the backward pass basically.

the backward pass basically. then you do a laplas approximation in

then you do a laplas approximation in the way I just described. Um if you do

the way I just described. Um if you do this then actually this hession

this then actually this hession simplifies a bit. It gets a more

simplifies a bit. It gets a more restrictive structure that's nasty to

restrictive structure that's nasty to write down and also sometimes nasty to

write down and also sometimes nasty to implement but it's easier to do than the

implement but it's easier to do than the full hessian.

full hessian. And the laplas approximation gives us a

And the laplas approximation gives us a gausian distribution over uh the

gausian distribution over uh the parameters. So now the last thing we

parameters. So now the last thing we need to do is to marginalize out those

need to do is to marginalize out those gausian distributions on the parameter

gausian distributions on the parameter space on the weights using this

space on the weights using this linearization assumption for the network

linearization assumption for the network and these four steps together tell us

and these four steps together tell us that we actually have a gausian process

that we actually have a gausian process postivia associated with our trade

postivia associated with our trade neural network. I still find this a very

neural network. I still find this a very interesting perspective

interesting perspective is like precisely because it is so

is like precisely because it is so simple. So there are other ways of doing

simple. So there are other ways of doing adding uncertainty to deep learning that

adding uncertainty to deep learning that tend to be more costly and less direct.

tend to be more costly and less direct. This is a very maybe it's a perhaps a uh

This is a very maybe it's a perhaps a uh uh somewhat daring thing to do because

uh somewhat daring thing to do because it's a very local approximation, right?

it's a very local approximation, right? might just be a very bad approximation

might just be a very bad approximation to the posterior because it only takes

to the posterior because it only takes the local shape into account. But it

the local shape into account. But it provides us with very tractable and very

provides us with very tractable and very interesting structures that I recommend

interesting structures that I recommend anyone doing deep learning should have

anyone doing deep learning should have at the back of their mind just to

at the back of their mind just to analyze their model. If you train a deep

analyze their model. If you train a deep neural network, you can think of a

neural network, you can think of a gausian process that you've you have

gausian process that you've you have constructed that has a mean function

constructed that has a mean function given by the network you just trained

given by the network you just trained and a co-variance function a kernel that

and a co-variance function a kernel that involves these green objects which are

involves these green objects which are the feature functions of your network

the feature functions of your network the jacobians at the trained point and

the jacobians at the trained point and this red thing this inverse covariance

this red thing this inverse covariance inverse hessen inverse curvature matrix

inverse hessen inverse curvature matrix um of the loss at theta R which is the

um of the loss at theta R which is the the bit of this framework that brings

the bit of this framework that brings the data into the uncertainty

the data into the uncertainty quantification because it contains the

quantification because it contains the loss on the training data. So these are

loss on the training data. So these are data structures that you can think of as

data structures that you can think of as representing the data set. So the

representing the data set. So the knowledge extracted from the data set

knowledge extracted from the data set its relationship to the function we are

its relationship to the function we are operating on that's these features and

operating on that's these features and then centered on the predictions that

then centered on the predictions that the network already does. And I did a

the network already does. And I did a whole lecture on like trying to argue

whole lecture on like trying to argue that this kind of functionality is

that this kind of functionality is practically useful that you can use it

practically useful that you can use it to do things like retrain parameter

to do things like retrain parameter efficient fine-tuning um uh

efficient fine-tuning um uh visualization of uncertainty

visualization of uncertainty uh

uh explainable machine learning and so on

explainable machine learning and so on and so on.

and so on. Okay, now um we're getting towards the

Okay, now um we're getting towards the more final parts of the lecture, but

more final parts of the lecture, but there's a question first.

Ah, you're asking a very good question. So for this to work, this matrix has to

So for this to work, this matrix has to be positive definite. And you're right

be positive definite. And you're right that in general this session may not be

that in general this session may not be positive definite.

positive definite. So if we do this linearization

So if we do this linearization um then this red matrix actually is

um then this red matrix actually is positive definite by construction.

positive definite by construction. So that's a nice aspect of this. Um that

So that's a nice aspect of this. Um that means at least you know that the thing

means at least you know that the thing you get will be a gausian process and it

you get will be a gausian process and it won't be nasty behaved. But of course it

won't be nasty behaved. But of course it might miss some important bits that come

might miss some important bits that come from this linearization. Um so that you

from this linearization. Um so that you we're basically dropping some higher

we're basically dropping some higher order terms in f

order terms in f and in fact it may well be in the future

and in fact it may well be in the future an interesting thing to study because

an interesting thing to study because what does it mean to have a hash that's

what does it mean to have a hash that's not positive definite means you're not

not positive definite means you're not actually at a minimum right? there's a

actually at a minimum right? there's a direction along which you can change

direction along which you can change your network and either you don't change

your network and either you don't change anything then it is invariant in some

anything then it is invariant in some direction that's I think a very common

direction that's I think a very common known aspect of deep learning or in fact

known aspect of deep learning or in fact you at a settle point and then there's a

you at a settle point and then there's a direction to take along which the loss

direction to take along which the loss will go down and then might be good to

will go down and then might be good to know about this right so yeah um totally

know about this right so yeah um totally good point but I think it's not a

good point but I think it's not a problem of the probabilistic

problem of the probabilistic interpretation it's just the general

interpretation it's just the general aspect of deep learning that that we're

aspect of deep learning that that we're posing non-convex optimization problems

posing non-convex optimization problems and uh then we we inherit problems like

and uh then we we inherit problems like this.

this. Okay, so that was the uncertainty

Okay, so that was the uncertainty quantification for deep learning part.

quantification for deep learning part. Now I'm going to quickly do the final

Now I'm going to quickly do the final few things which were admittedly a bit

few things which were admittedly a bit of a rectac collection of of of things

of a rectac collection of of of things that I felt like I have to do in a

that I felt like I have to do in a probabilistic machine learning class.

probabilistic machine learning class. The first one was sampling methods. So

The first one was sampling methods. So sampling methods are algorithms that use

sampling methods are algorithms that use random numbers. I started by pointing

random numbers. I started by pointing out that we often need expected values

out that we often need expected values in uh in statistical inference basian

in uh in statistical inference basian inference and in statistical inference.

inference and in statistical inference. So such expected values can actually be

So such expected values can actually be approximated using what's called the

approximated using what's called the Monte Carlo estimate. And what this is

Monte Carlo estimate. And what this is is a is a sum

is a is a sum or finitely many evaluations of the

or finitely many evaluations of the function whose expectation we're trying

function whose expectation we're trying to construct

to construct using evaluation points that are iid

using evaluation points that are iid samples from the distribution P.

samples from the distribution P. And I put IID samples in quotation marks

And I put IID samples in quotation marks because I did a big spiel for a whole

because I did a big spiel for a whole 90-minute lecture that confused some of

90-minute lecture that confused some of you and some of you kind of liked it to

you and some of you kind of liked it to point out that this idea of IID

point out that this idea of IID randomness is very flawed as a concept.

randomness is very flawed as a concept. In fact, it it relies on things that

In fact, it it relies on things that fundamentally cannot be computed which

fundamentally cannot be computed which are then approximated with things that

are then approximated with things that aren't actually random and it's all a

aren't actually random and it's all a bit weird. Um nevertheless this use of

bit weird. Um nevertheless this use of these pseudo random numbers has become

these pseudo random numbers has become very common in machine learning and we

very common in machine learning and we do it all the time even though and

do it all the time even though and actually we tend to do it in a totally

actually we tend to do it in a totally deterministic fashion using fixed random

deterministic fashion using fixed random keys so that we can run the code several

keys so that we can run the code several times and it produces the same plot

times and it produces the same plot every time. So there isn't actually any

every time. So there isn't actually any randomness but whatever. Um

randomness but whatever. Um and this this philosophical problem also

and this this philosophical problem also actually carries over into the analysis

actually carries over into the analysis of these methods. There is a classic

of these methods. There is a classic argument that Monte Carlo estimates are

argument that Monte Carlo estimates are in some sense optimal because they

in some sense optimal because they achieve the optimal convergence rate of

achieve the optimal convergence rate of an unbiased estimator, the chromaro

an unbiased estimator, the chromaro bound. Um, however, that bound is

bound. Um, however, that bound is fundamentally already using the idea of

fundamentally already using the idea of randomness. So, it turns out that at

randomness. So, it turns out that at least for some problems, there are

least for some problems, there are actually estimates that you could call

actually estimates that you could call biased if you like, but it's not really

biased if you like, but it's not really a concept that makes much sense if

a concept that makes much sense if you're using a deterministic quantity

you're using a deterministic quantity that converge much much faster.

that converge much much faster. Nevertheless, I don't want you to think

Nevertheless, I don't want you to think that Monte Carlo methods are completely

that Monte Carlo methods are completely stupid or shouldn't be used. I'm using

stupid or shouldn't be used. I'm using sampling methods in some of my work as

sampling methods in some of my work as well. They're just sometimes very

well. They're just sometimes very convenient. Maybe in your head, you can

convenient. Maybe in your head, you can think of Monte Carlo methods as

think of Monte Carlo methods as algorithms that are the the first thing

algorithms that are the the first thing you might want to try, but the last

you might want to try, but the last thing you should actually do. By which I

thing you should actually do. By which I mean that they can be um very easy to

mean that they can be um very easy to implement. uh often they just need

implement. uh often they just need access to um p an unnormalized

access to um p an unnormalized probability density function which is

probability density function which is often easy to evaluate. For example, if

often easy to evaluate. For example, if you can just multiply some prior with

you can just multiply some prior with some likelihood no matter what the

some likelihood no matter what the likelihood and the prior is you have

likelihood and the prior is you have these objects and then you can just call

these objects and then you can just call these methods and let them run

these methods and let them run and then they tend to provide samples

and then they tend to provide samples that converge reasonably quickly.

that converge reasonably quickly. They're often imple easy to implement

They're often imple easy to implement for in particular quite unstructured

for in particular quite unstructured problems where you just don't know what

problems where you just don't know what to do elsewise. Um we encountered

to do elsewise. Um we encountered several such methods like well first of

several such methods like well first of all the basic method that just uses

all the basic method that just uses direct samples from a distribution

direct samples from a distribution that's sometimes called simple Monte

that's sometimes called simple Monte Carlo and then algorithms that are more

Carlo and then algorithms that are more approximate like rejection sampling

approximate like rejection sampling important sampling all sorts of mark of

important sampling all sorts of mark of chair Monte Carlo methods like um the

chair Monte Carlo methods like um the basic class Mopolis Hastings GIP

basic class Mopolis Hastings GIP sampling Hamiltonian Monte Carlo Monte

sampling Hamiltonian Monte Carlo Monte Carlo and so on um which uh are all

Carlo and so on um which uh are all increasingly complicated approximations

increasingly complicated approximations that um only asytoically actually draw

that um only asytoically actually draw IID from the distribution. Nevertheless,

IID from the distribution. Nevertheless, these are easy to implement and then

these are easy to implement and then they quickly provide a bunch of numbers

they quickly provide a bunch of numbers that give a rough estimate one over

that give a rough estimate one over square root of number of samples

square root of number of samples convergence for the thing you're trying

convergence for the thing you're trying to estimate. And if that's what you

to estimate. And if that's what you going for, that's good. If you want to

going for, that's good. If you want to have a highly performant algorithm that

have a highly performant algorithm that um can run on large data sets or on many

um can run on large data sets or on many many instances of the same problem,

many instances of the same problem, something else might be a better idea.

talking about something else. I then because I was talking about random

because I was talking about random numbers already had a lecture on

numbers already had a lecture on generative models on diffusion and flow

generative models on diffusion and flow matching models which are now the

matching models which are now the state-of-the-art at the moment for the

state-of-the-art at the moment for the generation of natural images and

generation of natural images and everyone has used one of them probably

everyone has used one of them probably already. So we were interested together

already. So we were interested together in how they worked and one way to

in how they worked and one way to address them as I argued back then is

address them as I argued back then is from the sampling perspective. We may

from the sampling perspective. We may discover that um sampling from a

discover that um sampling from a complicated distribution over some

complicated distribution over some variable X0 is uh hard even for very

variable X0 is uh hard even for very good sampling methods like advanced mark

good sampling methods like advanced mark of chain Monte Carlo methods where

of chain Monte Carlo methods where complicated might might mean but doesn't

complicated might might mean but doesn't have to mean that the distribution has

have to mean that the distribution has islands. it only has a few points on

islands. it only has a few points on which it has non-zero mass. It might

which it has non-zero mass. It might also mean that it's spans some kind of

also mean that it's spans some kind of manifold that is very slim within the

manifold that is very slim within the space uh parameterized by your x that

space uh parameterized by your x that you would like to sample from. In such

you would like to sample from. In such situations,

situations, one operation that is quite interesting

one operation that is quite interesting that is not part of classic mark of shim

that is not part of classic mark of shim on the color methods is to take the

on the color methods is to take the distribution over x0 and convolve it

distribution over x0 and convolve it with a gausian. So that means you create

with a gausian. So that means you create a new random variable y which is the old

a new random variable y which is the old random variable x0 plus gausian noise.

random variable x0 plus gausian noise. When we do that the new dist the new

When we do that the new dist the new variable y tends to have a distribution

variable y tends to have a distribution that is easier because it looks a bit

that is easier because it looks a bit more gausian. So it's easier to draw

more gausian. So it's easier to draw from. It's a bit wider. It's a bit less

from. It's a bit wider. It's a bit less isolated with islands. And what we can

isolated with islands. And what we can do is we can let our Monte Carlo method

do is we can let our Monte Carlo method operate in this more noisy space, more

operate in this more noisy space, more regular space and then take a step and

regular space and then take a step and estimate what the corresponding x0 is by

estimate what the corresponding x0 is by trying to invert the noising process by

trying to invert the noising process by dnoising the sample. Now we know we

dnoising the sample. Now we know we notice that this isn't like this is a

notice that this isn't like this is a bit of a zero sum game because the more

bit of a zero sum game because the more noise you add, the easier you make the

noise you add, the easier you make the distribution to draw from, but the

distribution to draw from, but the harder you make the estimation process.

harder you make the estimation process. So the really cool idea is to say ah

So the really cool idea is to say ah maybe we just don't do one noisy step

maybe we just don't do one noisy step but we take a continuum or at least a

but we take a continuum or at least a large number of small noise steps that

large number of small noise steps that go from the original sample to something

go from the original sample to something that can be considered basically a

that can be considered basically a standard Gaussian distribution and then

standard Gaussian distribution and then instead of doing a single estimation

instead of doing a single estimation step backwards we do a whole long

step backwards we do a whole long sequential denoising process because

sequential denoising process because then every individual estimation step

then every individual estimation step might actually work quite well because

might actually work quite well because the local noise is very small. And to do

the local noise is very small. And to do that, we need access to a model that can

that, we need access to a model that can guess the noise basically. And we

guess the noise basically. And we discovered that that's related to

discovered that that's related to estimating the score of the um this

estimating the score of the um this noising distribution

noising distribution at any point in time t.

at any point in time t. And uh that can be learned for example

And uh that can be learned for example with some deep neural network which I

with some deep neural network which I showed a simple very simple example for

showed a simple very simple example for and of course now there are very

and of course now there are very elaborate such models that uh can

elaborate such models that uh can generate really powerful uh sampling

generate really powerful uh sampling algorithms that can sample from

algorithms that can sample from basically the manifold of natural images

basically the manifold of natural images even conditional on some uh word

even conditional on some uh word description of what should be in the

description of what should be in the image. There's a continuous limit of

image. There's a continuous limit of this that actually is for a noise

this that actually is for a noise process is a stocastic differential

process is a stocastic differential equation and that backward dnoising

equation and that backward dnoising process follows the score of the

process follows the score of the probability distribution which can also

probability distribution which can also be phrased in terms of an ordinary

be phrased in terms of an ordinary differential equation without noise

differential equation without noise where you basically just draw samples

where you basically just draw samples once and then the dynamics of the entire

once and then the dynamics of the entire set of samples follows an OD that's

set of samples follows an OD that's called the probability flow and such

called the probability flow and such models that use this OD are called flow

models that use this OD are called flow matching models. and are now maybe the

matching models. and are now maybe the state-of-the-art at the moment at least

state-of-the-art at the moment at least at the point of talking um for gener

at the point of talking um for gener generation of samples from complicated

generation of samples from complicated distributions.

distributions. Okay, that was the generative modeling

Okay, that was the generative modeling part and there's questions for that.

part and there's questions for that. >> Xer

say again shouldn't Oh, is this is the the verb wrong?

We say it becomes harder as alpha increases in the second point.

increases in the second point. >> Ah, as al yes decreases I'm sorry you're

>> Ah, as al yes decreases I'm sorry you're right. So as alpha decreases we get

right. So as alpha decreases we get closer to a standard gausian right this

closer to a standard gausian right this this becomes a one this becomes a zero.

this becomes a one this becomes a zero. Yeah you're right. Um yeah.

Yeah you're right. Um yeah. Okay. So now um in the very final three

Okay. So now um in the very final three lectures of the of term I introduced

lectures of the of term I introduced another concept for approximate

another concept for approximate inference which is historically very

inference which is historically very important and continues to be used today

important and continues to be used today in many applications often for

in many applications often for analytical purposes

analytical purposes but has become maybe less prominent than

but has become maybe less prominent than it was a while ago. And so I'm myself

it was a while ago. And so I'm myself also a bit uncertain about how much I

also a bit uncertain about how much I should teach about it. So I ended up

should teach about it. So I ended up investing well two and a half lectures

investing well two and a half lectures into it. And we started getting to to

into it. And we started getting to to this framework called variational

this framework called variational inference by first uh studying a even

inference by first uh studying a even more classic algorithm called EM. Em is

more classic algorithm called EM. Em is a quite specific algorithm for finding

a quite specific algorithm for finding maximum likelihood estimates in cases

maximum likelihood estimates in cases where the likelihood itself

where the likelihood itself of the data the marginal likelihood is

of the data the marginal likelihood is hard to evaluate but can be made easy to

hard to evaluate but can be made easy to evaluate if we add latent variable set.

evaluate if we add latent variable set. So this object without the integral

So this object without the integral around it just this in inside here is

around it just this in inside here is called the complete data log likelihood

called the complete data log likelihood for example we saw that this shows up in

for example we saw that this shows up in gausian mixture models or in general in

gausian mixture models or in general in mixture models actually in mixture

mixture models actually in mixture models this zed tends to be the variable

models this zed tends to be the variable that tells us which data belongs to

that tells us which data belongs to which mixture component and then

which mixture component and then everything becomes easily tractable.

everything becomes easily tractable. This algorithm consists of two steps

This algorithm consists of two steps that get iterated. In the first one, we

that get iterated. In the first one, we set a distribution over zed to the

set a distribution over zed to the posterior over zed given x at a

posterior over zed given x at a particular point in the parameter space.

particular point in the parameter space. And we can think of this as a sort of

And we can think of this as a sort of very important look ahead into

very important look ahead into variational inference as constructing an

variational inference as constructing an approximation to the postivia at the

approximation to the postivia at the correct value theta

correct value theta that happens to have the property that

that happens to have the property that it minimizes the kl divergence to the uh

it minimizes the kl divergence to the uh postivia at the current point in the

postivia at the current point in the parameter space theta and then in a

parameter space theta and then in a second step the so-called maximization

second step the so-called maximization step we change the parameters. So in

step we change the parameters. So in this in the estimation step in E we

this in the estimation step in E we estimate a latent variable set and then

estimate a latent variable set and then in the maximization step we change theta

in the maximization step we change theta we change the parameters to maximize

we change the parameters to maximize this expression which is the expected

this expression which is the expected complete data log likelihood an

complete data log likelihood an expectation under the postivio over the

expectation under the postivio over the logarithm of the complete log data

logarithm of the complete log data likelihood um plus a constant which is

likelihood um plus a constant which is the entropy of Q and um so we found that

the entropy of Q and um so we found that this object which we maximize

this object which we maximize is called the evidence lower bound

is called the evidence lower bound because it is a lower bound to the

because it is a lower bound to the evidence. So the evidence is this thing

evidence. So the evidence is this thing and the difference between this thing

and the difference between this thing and the evidence is a positive number

and the evidence is a positive number the KL divergence between our

the KL divergence between our approximation and the true posterior. So

approximation and the true posterior. So by setting locally at a particular point

by setting locally at a particular point in theta Q to this posterior we're

in theta Q to this posterior we're making this zero this bound becomes

making this zero this bound becomes tight and when we maximize it for theta

tight and when we maximize it for theta we n we have to also increase this value

we n we have to also increase this value and so we found an algorithm for

and so we found an algorithm for maximizing this likelihood without ever

maximizing this likelihood without ever actually evaluating this likelihood

actually evaluating this likelihood which is kind of cool. So now the next

which is kind of cool. So now the next step the final step for variational

step the final step for variational inference is to say huh maybe I can get

inference is to say huh maybe I can get away without setting this bound to make

away without setting this bound to make to exactly this. So without making this

to exactly this. So without making this k divergence zero but just by trying to

k divergence zero but just by trying to increase this object and thereby

increase this object and thereby minimizing this object because this is a

minimizing this object because this is a constant

constant right so in terms of zed at least right

right so in terms of zed at least right so when we try to maximize this we will

so when we try to maximize this we will find a q of zed that gets as close as

find a q of zed that gets as close as possible to the posterior and we can

possible to the posterior and we can think of that as an approximate

think of that as an approximate inference method for finding

inference method for finding distributions that are good

distributions that are good approximations to the posterior and

approximations to the posterior and these are called variation inference

these are called variation inference methods. Why? Because they vary

methods. Why? Because they vary an object Q that is a function. They are

an object Q that is a function. They are optimization algorithms that operate in

optimization algorithms that operate in the spaces of functions actually in the

the spaces of functions actually in the spaces of probability distributions Q

spaces of probability distributions Q rather than optimizing in zed to find

rather than optimizing in zed to find modes of distributions and then

modes of distributions and then describing them locally in terms of zed

describing them locally in terms of zed like tas approximations did.

like tas approximations did. But how do you quantify the space of

But how do you quantify the space of distributions over want which you want

distributions over want which you want to optimize? Well, an easy thing is that

to optimize? Well, an easy thing is that you could just set the you could decide

you could just set the you could decide that Q of set should be some parametric

that Q of set should be some parametric family like it should be a gausian for

family like it should be a gausian for example. Then you can just optimize the

example. Then you can just optimize the elbow directly. Okay, fine. Maybe you

elbow directly. Okay, fine. Maybe you want to do some integrals this integral

want to do some integrals this integral in there uh using Monte Carlo. That's

in there uh using Monte Carlo. That's called fixed form variational inference.

called fixed form variational inference. But there's actually a less restrictive

But there's actually a less restrictive form that often is surprisingly possible

form that often is surprisingly possible to do that only imposes a little bit of

to do that only imposes a little bit of um of algebraic structure on Q and this

um of algebraic structure on Q and this is called the mean field assumption. So

is called the mean field assumption. So we assume that our Q of Z which we want

we assume that our Q of Z which we want to be a good approximation to the true

to be a good approximation to the true posterior on Zed has to factoriize along

posterior on Zed has to factoriize along some set of variables.

some set of variables. If we do that then it turns out that the

If we do that then it turns out that the evidence lower bound the elbow can be

evidence lower bound the elbow can be written as a um as an expected value. So

written as a um as an expected value. So the evidence lower bound as a function

the evidence lower bound as a function of one particular of these variables

of one particular of these variables that J can be written as a an expected

that J can be written as a an expected value of the log complete data log

value of the log complete data log likelihood um under expectations over

likelihood um under expectations over the other variables.

the other variables. And because this is a function of Zj,

And because this is a function of Zj, even if we don't know how to do this

even if we don't know how to do this integral, we tend to be able to see from

integral, we tend to be able to see from the functional form of a generative

the functional form of a generative model what the um functional form of our

model what the um functional form of our approximation should be. And then if we

approximation should be. And then if we do that iteratively across all the

do that iteratively across all the variables, we are often able to do the

variables, we are often able to do the individual expectations because they

individual expectations because they become for example exponential families

become for example exponential families for which we know how to do the

for which we know how to do the expectations.

expectations. And that leads to iterative algorithms

And that leads to iterative algorithms that can sometimes be very efficient at

that can sometimes be very efficient at constructing like algorithmically

constructing like algorithmically efficient and distributions of high

efficient and distributions of high quality um for approximate inference.

quality um for approximate inference. Doing that requires writing down a lot

Doing that requires writing down a lot of equations on a piece of paper doing

of equations on a piece of paper doing some analysis for ourselves which is uh

some analysis for ourselves which is uh maybe like a last hold out of human

maybe like a last hold out of human labor in the times of AI.

And then on Tuesday I told you that this kind of framework if it's applied to

kind of framework if it's applied to mixture models can be thought of as

mixture models can be thought of as having something to do wly with

having something to do wly with attention mostly just to motivate you to

attention mostly just to motivate you to have a look with that I'm through term

have a look with that I'm through term but I'm not completely done. I have like

but I'm not completely done. I have like two minutes left. I'd like to summarize

two minutes left. I'd like to summarize on a very high level what I was trying

on a very high level what I was trying to get across for this time. This is for

to get across for this time. This is for many of you um one of the two two or

many of you um one of the two two or three key lectures in your master's

three key lectures in your master's degree on AI and machine learning and

degree on AI and machine learning and computer science. I wanted to endow you

computer science. I wanted to endow you with basic understanding for the

with basic understanding for the relationships between model classes, the

relationships between model classes, the functionality that is available in

functionality that is available in machine learning and how to think about

machine learning and how to think about it. Not so much just about

it. Not so much just about implementation. Of course, we did that

implementation. Of course, we did that as well. I wanted to convince you that

as well. I wanted to convince you that probability theory as a mathematical

probability theory as a mathematical framework provides the foundation and

framework provides the foundation and the framework for the design of machine

the framework for the design of machine learning methods because it formalizes

learning methods because it formalizes the process of extracting information

the process of extracting information from all sorts of data uh sorry all

from all sorts of data uh sorry all sorts of information and that might be

sorts of information and that might be empirical data but it may also be

empirical data but it may also be mechanistic knowledge like knowing that

mechanistic knowledge like knowing that there's a differential equation that

there's a differential equation that drives a system or computational

drives a system or computational information like knowing that you have

information like knowing that you have evaluated the gradient of a vector field

evaluated the gradient of a vector field a bunch of times and now you know a

a bunch of times and now you know a little bit about the vector field but

little bit about the vector field but not about the entire thing.

not about the entire thing. And because the mechanism of

And because the mechanism of probabilistic inference is fundamentally

probabilistic inference is fundamentally intractable, it requires us always to do

intractable, it requires us always to do approximations. Sometime someone asked

approximations. Sometime someone asked last week, isn't that a problem that we

last week, isn't that a problem that we always have to do these approximations?

always have to do these approximations? Everything always ends up gausian in

Everything always ends up gausian in your lecture. Why can we even trust

your lecture. Why can we even trust this? Well, it turns out that everyone

this? Well, it turns out that everyone in science has been doing this for at

in science has been doing this for at least the last two centuries, including

least the last two centuries, including Newton and Einstein and other people who

Newton and Einstein and other people who probably really knew better than most

probably really knew better than most others because you don't get around

others because you don't get around writing down like making approximations.

writing down like making approximations. If you want to write down something, you

If you want to write down something, you can actually do either on a piece of

can actually do either on a piece of paper or on a computer. And so all of

paper or on a computer. And so all of including modern machine learning is a

including modern machine learning is a form of approximation. the probabistic

form of approximation. the probabistic perspective just allows us to reflect on

perspective just allows us to reflect on this better.

this better. Um, and of course there are powerful

Um, and of course there are powerful mechanisms like the ones we discussed

mechanisms like the ones we discussed today and over the course of the term

today and over the course of the term that actually work that you can actually

that actually work that you can actually apply and they allow you to get insights

apply and they allow you to get insights either to build real tools like the

either to build real tools like the gausian process models we built or these

gausian process models we built or these exponential family models you or of

exponential family models you or of course the fusion models.

course the fusion models. But sometimes they also help us get

But sometimes they also help us get insights like this story about deep

insights like this story about deep learning this differential geometry

learning this differential geometry perspective on on on deep neural

perspective on on on deep neural networks that allows us to identify new

networks that allows us to identify new functionality for existing models. So

functionality for existing models. So for example in deep learning there are

for example in deep learning there are classes of methods that were not were

classes of methods that were not were invented in a way that had nothing to do

invented in a way that had nothing to do with Beijian inference. But now that we

with Beijian inference. But now that we think about them as Beijian models um we

think about them as Beijian models um we discover that we can use these insights

discover that we can use these insights to create new functionality like

to create new functionality like continuous learning for example.

continuous learning for example. That's what I wanted to say at the end.

That's what I wanted to say at the end. Now I'm one minute over time. If you

Now I'm one minute over time. If you have more questions, you can also come

have more questions, you can also come to the front. I'm grateful that you were

to the front. I'm grateful that you were here all all the time. Um um and I'm

here all all the time. Um um and I'm looking forward to see many of you on

looking forward to see many of you on Monday at 8:00 a.m. for the exam. Thank

Monday at 8:00 a.m. for the exam. Thank you very much.

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Probabilistic ML - 25 - Revision