YouTube Transcript: Probabilistic ML - 25 - Revision
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content is a comprehensive recap of a probabilistic machine learning course, emphasizing the foundational role of probability theory, Gaussian distributions, and various approximation techniques for building and understanding complex models.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
So what I have prepared now for the next
well let's see how long it takes is uh
no new content obviously but a quick
summary of everything we've gone through
at least the highlights the main ideas
um from a like not the highlights from
like novel ideas perspective but like
the foundational stuff that I've covered
over the last 24 lectures.
and I'll uh try and go through that. But
um you can also stop me at any point in
time with questions, right? So um I
don't even mind if you don't make it
through all of the slides. Something
like 20 odd. So yeah,
but that's okay.
So let me begin. We started in lecture
one with the laws of probability which
all of you have encountered in previous
lectures before um and I've already
written it down here on this slide
copied over the content from the lecture
number two where I introduced
probability densities as representations
of probabilities over continuous spaces
as the natural object which takes on the
structure that we also have in discrete
probabilities. um namely the two main
rules of probability theory, the sum
rule and the product rule. This is the
rule for getting rid of something you
don't know and this is the rule for
conditioning on something you do know
and then talking about the stuff you
don't know. Um this the two of them
taken together amount to base theorem
which is the fundamental rule for inference
inference
everywhere not just in science but
literally everywhere. By the way, when I
walked in here, I saw a a sheet here
that's from my colleague Martin Boots
for cognitive architectures and um it's
his cheat sheet that he hands out to the
students, the formal zamlong and it
actually contains base theorem as well
and then variations of it. Um so
cognitive science clearly also an
application field for continuous
probability distributions. We have to be
a bit careful. There's this minor thing
that some annoyingly complicates life
that um if you rescale the domain over
which you assign a probability then you
have to make sure that density is
correctly corrected by multiplying with
the determinant of the jacobian of the transformation.
Okay, so that's the fundamental rule.
That's just a phil that's just a
philosophical statement about how to
reason in a situation where your
information sources your data does not
fully provide the knowledge about the
variable X that you'd like to know.
But unfortunately these abstract
statements which are sort of
manipulations of functions or functionals
functionals
uh are uh effectively intractable for
the general case. If you have a general
function for every term in here, if
that's a general distribution, some
general probability distribution that
just sums to one uh for zed and for x,
then um this object here, well, you can
multiply it pointwise, but even the
integral is difficult uh to evaluate at
in the normalization constant. One way
to think about this is that um if you
have a discrete probability distribution
then the number of possible uh um states
that you have to consider here is all
the possible combinations of these uh
discrete variables let's say they are
binary random variables then the set the
number of configurations to consider
grows as two to the number of variables
right so in this case it's if x and zed
are both binary we already have two to
the two, four and if you had three,
four, five, six variables then the
complexity of this inference process
would exponentially grow with the
dimensionality of the problem. And so
the entire course from lecture three
onwards was all about making things
tractable to reduce computational
complexity so that we don't have to
consider the combinatorial set of
possible hypotheses which obviously is
not possible for continuous spaces in
any case.
And the key tool for this were Gausian distributions.
distributions.
If all variables in a generative model
are jointly gausian distributed and
linearly related to each other, then inference
inference
maps to linear algebra. And I wrote down
these initially cryptic equations which
by now I hope you have developed an eye
for. So if you have a a a prior
distribution over a latent quantity X
and observe data Y with a Gaussian
likelihood that links the data linearly
to X, then both the evidence and the
postivia are closed form expressions.
Gausian distributions. Gausians are
parameterized by a mean and a coariance,
a vector and a matrix. And that um those
this vector and matrix this point
estimate and uncertainty estimate have
an interesting structure that though
it's a long formula
amounts to linear algebra. You have to
build some matrices and then solve a
linear system of equation that we tend
to write with an inverse or a matrix
multiplied with um the vector. But as we
learned, you should really be thinking
about this as like this joint object
inverse matrix times vector as another
vector that solves a linear system of equations.
And we learned about how to interpret
these these uh quantities. So um this
object here this uh gram matrix is the
expected variance of y under the prior
actually of y minus the the mean
prediction. This is the me this is what
you thought you would see. It's your
best guess for what you're going to see.
This is what you did see. This residual
between the two. This is the scale on
which you expect this this thing to to
vary. So this times this gives us some
kind of innovation. how surprised we are
about the data we got to see and then we
project this push this towards the
variable that we actually care about
because this quantity is the co-variance
between this object and x.
Similarly the gram the posterior
coariance also has this sort of update
form that looks a bit like a sure
complement. There is a initial
coariance, prior covariance and then a
correction which involves the same
matrix inverse that's in here as well.
And the this means that adding
uncertainty in gausian models doesn't
increase computational cost because you
can reuse computational effort used to
find the the mean estimate. That's a
first insight that sometimes being
probabilistic having uncertainty doesn't
actually have to have cost overhead.
We also noticed that this is um I'm
going to talk about that in a second. Um
so then then we talked actually in the
next lecture I then talked uh quite at
length I made you sit through a long
lecture about linear algebra the tools
we use to solve problems of this form
and we discovered that there's actually
nothing new hidden underneath this
linear problem because it's just more
instances of gausian inference.
The way the path to that was to discover
or kind of mentally do this exercise
that we could be under maybe an
underlying structure is actually that a
model like this is of an exchangeable
nature in the sense that I can start
with the first entry in Y the very first DATM
DATM
condition on it and then I because of
this equation I get back a new gausian
that I can then condition on the next
datm one after the other. So gausian
inference is a form of closure. You
start with a gausian you get a gausian
data point you get back a gausian. That
means we can split up our data set into
into individual terms
scalar observations one after the other.
And if you do that, then these inverses
here are just scalar inverses. And we
never need to call some fancy
numerically linear algebra code, at
least in principle. It's still probably
a good idea to do this if you're if you
have a decent sized data set. But in the
corner cases of very large data sets and
sometimes also very small data sets, it
can be useful to realize that the
algorithms we tend to use to solve such
linear systems of equations, something
like this. So an algorithm that consists
of a matrix decomposition first that
returns a solver like object a function
that we can call on right hand sides.
Algorithms like this are just methods
that do gausian inference in a
sequential fashion.
For example, this is true for the QR
decomposition, for the LU decomposition,
for the Chileleski decomposition, for
conjugate gradients, for precondition
conjugate gradients, for lans algorithms
for the decomposition of matrices into
SVDs and so on and so on and so on. All
of these algorithms iteratively project,
that's this policy up here, the rows or
columns of our matrix A
or A transpose
um onto another onto the the current
postivia of a Gausian distribution to
construct a set of um conjugate
directions u using the Graham Schmidt
process or modifications thereof. of to
um uh construct basically a a a data
structure in memory that can that is
either numerically stable in the case of
methods like the QR composition, LU and
chlesi and so there the policy is chosen
such that everything like neatly aligns
and there is very little overlap between
the directions or they choose the
direction such that they provide maximal
information about some right- hand side
B or about the solution for some right
hand side B as in the case of conjugate
gradient or lush iterations or so on um
at the cost of numerical stability then
but at the at an increased gain in
convergence. Why was this useful? Well,
it's one direct tangible outcome of this
is that we can actually think of linear
algebra methods as a form of data loader
in the Gaussian case. They allow us to
pick informative views onto the data
set. And that then might mean that we
don't have to load the entire data set.
You can stop the computation after some
well reasonably short computational
effort compared to solving the full
problem and then stop the computation
and provide an uncertainty as an output
of this method that actually takes the
fact that we've only loaded parts of the
data into account rather than guessing
what the uncertainty should be.
So over the course of these lectures 1 2
3 four actually no three four and then
also five we kept building a software
library along the side that I don't have
on the slides um that um
implements this gausian mechanism in a
abstract data class effectively with a
bunch of methods and I did this just a
second on purpose to highlight that
these gausian come with all this
functionality and as soon as you know
something is gausian you can apply all
of these mechanisms directly. Now
there's your question
directly related to
the fact that you can stop or is that
two benefits that are just >> huh?
>> huh?
>> Okay, so that's a good question. The
question is does is the I said something
about being able to stop early and about
thinking about a data loader and how are
the two like are the two related to each other
I would say I mean the reason to stop
early is that you think that you've
learned enough right so in principle
any so okay maybe the formal statement
is any of these solvers of this form can
be stopped at any point in time
including after zero iterations then you
This is for example also true for the QR
de composition or tleski de composition
and this is maybe
useful to point out on its own because
in uh sort of numerical math lectures
you will you you will typically hear
about methods like the cheski de
composition the qr de composition as
so-called direct algorithms. So people
will say that these are methods you have
to call and then you have to stand back
and wait for a cubic amount of time and
then they'll give back the true
solution. But in fact you can stop these
methods early. It's just that they tend
to have on average in the average case a
very slow convergence. There isn't
really a reason to expect them to
converge quickly. And so that's probably
why TB people want to wait for them to
complete. In fact, there's a theorem if
you really want to read up on this. It's
unrelated to the exam though um in the
it's actually in the appendix of the HDI
paper from 1950 odd something um which
introduces the conjugate gradient
method. So the invention the inventors
of the conjugate gradient method they
pointed out that conjugate gradient was
necessary because these methods converge
slowly and they have some worst case
construction showing that if you run one
of these direct methods like qr or lu or
uh chleski dempositions
then you can pick the linear systems of
equations such that the error on the on
like this mean that comes out of this
method increases in every single single
step of the method gets worse and worse
and worse all the time until the very
end and then in the very last step
everything converges and you get a
perfect correct answer and that's not
true for these more iterative methods
like conjugate gradient and uh lansour
SVDs and so on and so on um
and so therefore people tend to think
about this as different things right or
actually maybe maybe also as connected
things so if you have methods that tend
tend to not be good for a long time,
then why run them only for a short
amount of time? But if you think of
these methods as quantifying
uncertainty, then it becomes maybe more
natural to think about them as stoppable
at any time because then what's going to
happen in a situation where the
algorithm just doesn't learn anything
from all the iterations is that uh the
uncertainty won't collapse either. So
you'll just have a long a big arrow bar
that just stays pretty much almost
constant for a very long time and then
suddenly collapses to zero.
Yeah. So if you have a data loader that
is useless that always always loads the
worst use less least informative data
then maybe it's not a good policy to
stop early. But if you happen to have a
data loader that is very efficient at
loading the right data then you might
want to stop early and conjugate
gradient for linear gausian problems is
an example of a very efficient data loader.
loader.
So what do we do with these tools? Also
with the software tools that we
developed, they became the foundational
framework for maybe the most important
task in machine learning even to this
day at least as an abstract task. Namely
learning functions,
the regression.
We initially looked at real valued functions
functions
where we began with a traditional
classic perspective on parametric
regression. So the assumption is we have
a supervised machine learning problem.
There will be pairs of inputs x and
outputs y. We um make an assumption that
the function that links x to y has a
parametric form. It is given by some
features of the x. So a function fi
evaluated on the data x weighted with
some unknown weight vector w. That's a
generic form for formalism. This f ofx
we could maybe also think of as a deep
neural network or we could just think of
it as a bunch of features you pick by hand.
hand.
Then in such a situation the important
key observation here is that the weights
enter linearly in this problem. This may
well be a very nonlinear function of x
but it's a linear function of w and that
means we can directly without changes
apply this gausian formalism. We put a
gausian prior over the weights. um the
that implies by the way by some of the
properties of gausian uh also a gausian
prior over the function values. Then we
use a gausian likelihood for the
observations. So that makes the two key
assumptions gaus and linear
relationship. Then we can look up
everything that was on the previous
slides and find that the posterior over
the weights given the data is a gausian
which involves these linear algebra
expressions for which we can use these
>> Yes,
>> that's actually a very good observation.
It's a standard assumption in regression
to say that the noise on each datim is
always the same and also the data
independent of each other. So
technically for what's here on the
slide, this is not necessary.
You could have a joint
complicated covariance matrix here. So a
full uh let's say there are n data
points. It could be a full n byn
symmetric positive definite matrix.
such a such a model has the fancy name a heteroscadastic
heteroscadastic
problem. So one that has different noise
on every datim and in fact the data
points are related to each other.
So in textbooks you tend to see this um
assumption with a scalar noise.
So the reason for this I think is so
there are there are certainly
applications of regression where you
actually have to have have to model
different noise on every data.
Um the the there are very few reasons
why you might computationally have to
use something like this a structure like
this. It's still very common. So maybe
one first observation is that if we had
data that has a joint full rank
nondagonal coariance
then uh we could use a singular value de
composition of the coariance matrix to
project our data set y onto a new data
set that is independent of each other at
least right with independent observations.
observations.
um those would then
uh still have a diagonal coariance but
we could rescale the observation so that
they have a scalar coariance. So in some
sense this is not a strict restriction
because you can take any data set and
remap it such that it has this form but
then it might lose a lot of the physical
interpretation right not necessarily
measurement anymore. The other thing is
that it's um I mean in statistics not in
probability theory but in statistics the
iid assumption is like very fundamental.
So the assumption that you get a lot of
data of the same type iid from some
source that's important to make you know
all sorts of theoretical analysis to
show convergence rates the stuff you
heard from him across the corridor. Um
and so this amounts to this making this
assumption that all the data are
distributed independently of each other.
That's what the one stands for. And that
they have the same distribution. That's
Okay. So if we do this then we get a
framework with which we can learn
functions. And we spoke about these
models for quite a while. I highlighted
that you're very free to pick the
features here. This feature set fi can
be literally anything you want to pick
including functions that are
discontinuous in even including
functions that are technically
unbounded. I mean you could even have a
framework where some of these f return
nans or infinities and then you have to
be a bit careful about how you implement
the inference but it's still possible to
do this like correctly. Um
and that's simultaneously maybe a
blessing and a curse. It means you get
if you really know something about your
problem, you can build very powerful
very well like designed constructed
algorithms. For example, I know of
applications, some of them are published
in major companies where engineers
invested a large amount of time finding
good features. So for a while a common
example uh that people talked about in
the community like a decade or so ago
was when Facebook became big um they uh hired
hired
several people out of another company
who became for for example Kuna Candela
who was from Tubingan as a postto
originally and then he became the head
of AI uh at at Facebook and he built a
massive tool for ad placement on
Facebook which some of you may have
suffered through because you were the
subject of it, right? If you were ever
were on Facebook. So, it's the the me
the the algorithm that decided which ads
to show and uh to get that to work had a
big team of engineers and they built a
software platform that allowed the
engineers to trial features. So they
could provide a new example fi something
like I don't know I'm looking for
customers who are uh who recently
announced their wedding
on Facebook right there some type of
like I don't know some some way of
posting that because that probably means
you're going to buy some very special
stuff right or something about babies or
about changing your relationship status
and so on and they had a tool for people
to trial these features so they could
get like 1% of the of the population as
an AB test setup and then they would
continuously compute the evidence where
is it it's going to be on the next slide
one afterwards um but it's actually I
mean it's basically this expression up
here right where the A's now include the
F check how well this works as a
predictive performance tool uh for
whether people would click on some ad
and if it worked well it would it would
automatically load those features and
add them in that was pretty deep
learning but it wasn't an approach
that's scaled to very large user bases.
You could do that with a few hundred
million people uh because you you know
get to like manually design this kind of
process. But of course it's also
like sort of disappointing as a designer
if you want to build a learning machine
something that automatically decides
what the how how to like best extract
information from a data set. If you then
have to sit down and manually decide how
this how this machine should do this.
And so people in machine learning have
spent well one of them like main trends
basically including deep learning is the
attempt to automate this process in
various ways. And so over several
lectures we discussed how to automate
that process. A first very powerful and
historically maybe the like for a while
very successful approach was to say what
happens if we increase the number of features
features
towards some asymtotic limit of
infinitely many features in some sense
we have to be very careful of course how
we do this how we do this limit process
of infinitely many features um um we did
this we discovered that there is this
basically it's on the previous slide
right there's this interesting structure
here that to compute these predictive
distributions at least if we think about
them as function values then there will
never be a lonely phi in this
expression. So if you have if you make
predictions not about the w's but about
the function values then you'll get here
a five sigma 5 sigma 5 sigma 5
expression and up here as well five
sigma 5. They are always inner products
and so inner products are sums and some
sums even work if they have infinitely
many terms because they have analytic
expressions that don't require you to
actually go through the sum. You already
know this from you know introductory
math lectures about series.
And that led to the idea of kernel
machines, algorithms which replace these
inner products with functions that can
directly be thought of as evaluations of
inner products in infinite dimensional
in general feature spaces. And the
associated notion in probability theory
is that of a gausian process, a
nonparametric model that has in some
sense an infinite number of degrees of
freedom that it can fit to a data set.
So gausian processes I I removed the
code here um from a computational
perspective can be thought of as a lazy
evaluation of a gausian model. So you
write a piece of code that basically
waits until it is given a data set and
once it's given this data set that's a
finite data set it constructs a finite
matrix a kernel G matrix um that's this
matrix here or actually it's this whole
thing solves this finite
um system of equations linear system of
equations and then stores the resulting
values which are vector of finite length
And the matrix deco composition of this
covariance which is a uh some some
matrix decomposition of finite size
might be for example a chleski or
something else updated using the
algorithms that we've encountered before
um and then provides access to a
function a kernel and a mean function
over here which can then like basically
act as a closure for a new gausian process.
process.
We studied these algorithms for quite a
while. We discovered that they had um
very interesting properties. For
example, we like did an example for constructing
constructing
uh gausian process prediction on a very
relatively simple but structured
one-dimensional data set this keing
curve of CO2 curves. We discovered that
you can actually use different kernels.
Um there's there's a few kernels and in
fact the so there's some kind of
standard set of kernels from which
people tend to choose things like the
matern kernel periodic kernels um the
matern sorry the matern family of
kernels which includes as a limit case
the square exponential a very common
kernel that is in some sense quite
deeply flawed but also the veno process
integrated veno process and so on and
actually there is more than just this
finite collection of kernels because
kernel you can construct new kernels
from old
in by uh through these four properties
of uh kernels. So scaling a kernel gives
back a kernel. Uh transforming the
inputs of a kernel in an arbitrary
fashion as long as you apply the same
transformation to both inputs gives back
a kernel. Adding kernels gives a kernel
and multiplying kernels gives a kernel.
So they form if you like a semi- ring of um
um
biariate functions that are all positive
definite and this these uh properties
can be used to again build maybe more
flexible models than we had for the
parametric form but again a relatively
expressive language in which designers
and engineers can encode knowledge about
specific problems that they are trying
to to address and I find it very
important to point this property out
because people who who try out gausian
process models for the very first time
on a particular data set tend to think
of them as some kind of black box. The
standard setting is that you someone
gives you some data set in a
professional environment. So you open up
some scypi uh or scikitlearn sorry uh
implementation of gausian process
regression that uses some standard
kernel don't think about it and then it
didn't doesn't work particularly well
because the standard kernel which is
usually a gausian kernel is not a good
choice and then people say oh so I tried
a gausian process it didn't quite work
so well so I did something else and then
they spend a lot of a lot lot more time
trying to fit some large deep learning
model um and because they get invested
in it they then make it work So methods
like this still work well on many data
sets if you are careful to design them
right. A part of this process of
designing them right is to pick smart
kernels that describe what you are
trying to do. If the data set is
sufficiently low dimensional then this
can be checked by plotting prior means
and variances and samples from the prior.
prior.
But then you're typically also left with
a bunch of parameters in these models.
So parameters here could be length
scales of kernels or the individual
strengths of individual components in
sums of kernels. And um I pointed out in
the lecture on the example for how for
uh how to build gausian process models
that such parameters can be learned in
gausian models and that's actually a
first instance of a quite generic
framework which is maybe at the boundary
between probabilistic/basian
inference and more statistical analysis
namely marginal likelihood estimation or
maximum type two type two maximum uh
likelihood estimation. This uses the
insight that in base theorem the
normalization constant the evidence this
thing down here
for um is can be thought of as well
that's actually what the term evidence
is supposed to highlight can be thought
of as a likelihood
not for the latent quantity that we're
trying to compute that's up here but a
likelihood of higher order for the model
the generative model over the latent
quantity and the data. So in this case
for theta
and so just like we can use this first
order likelihood and it's prior to
computer posterior over f we can use
this likelihood to um in a in a in a
mechanism to estimate theta.
Typically this likelihood will not be of
gausian form um and it might not even be
tractable in general models but for
gausian it is. uh but then we can try to
find maximum likelihood estimates or
maximum apostroori estimates by mult
multiplying with a prior for theta for
gausian this happens to have analytic
form. So uh one of the properties of
gausian distributions is that the
evidence this distribution over y given
only the model when we marginalize out
the f the unknown function or the
unknown weights actually is itself of an
analytic form by analytic form here
means it's it's the expression of a
gausian probability density function for y
y
evaluated at some mean and some
covariance that is a function of theta.
So as a function of theta, if you think
of this as a likelihood, it's not a
gausian function in contrast to the
weights for the linear model, but it's a
function that we can analytically
evaluate for any value of theta
and that means we can optimize for it.
So we can write down particular the
logarithm of this probability
distribution which is this expression
has a very specific form. It involves
this square error, weighted square
error, and then a complexity penalty,
something that rises with the
determinant of the coariance matrix,
which is a measure of entropy. Um, so
like how much volume is in your space of
hypothesis, how many hypotheses you're
uh entertaining.
And uh this expression as a function of
theta we can implement today at least in
automatic differentiation frameworks
feed into an optimizer and then optimize.
If our um model for the function in here
and in here contains features
that uh we can sort of parameterize in
the form of a differentiable program
like a deep neural network. Then this is
actually one way of learning a deep
neural network and you can tell that
it's slightly different from the
standard way of learning deep neural
networks which only involves this term
but not this term.
For that we just need to be able to get
this lock determinant.
So that was the generic framework for
learning functions. And then one of the
things we discovered is that if this
framework at least if you implement it
using numerical linear algebra rather
than stoastic gradient descent
tends to be expensive because the size
of this matrix in here grows
linearly with the number of data. So if
y is of length n then this matrix is of
size n byn.
And that means at least in the worst
case if you want to build a user matrix
decomposition of these direct forms that
I mentioned before then the cost of
constructing this solution will grow
cubically with the number of data
points. Um that's in contrast to the
parametric form
the the this one where there is a matrix
in here at least there's one way of
writing this problem that that is uh
constructs a matrix that is of constant
shape in this whose size is given by the
number of features we are considering
and so that matrix only contains terms
in here that involve a sum over the
features. So that means that sum grows
quadratically with the number of data
points but the overall like this
inversion problem here um stays constant
over time. So with cubic in the number
of features so this is a reason why uh
people used to argue that gausian
process models don't quite work for very
large data sets. This has changed a
little bit with the advent of better
numerical linear algebra. There are now
tools like GPI torch that allow gausian
process inference for up to like a
million data points. Maybe not for like
internet size data but already quite large.
large.
Nevertheless, there are some data sets
that are just even larger. In fact,
there are some data sets that are
effectively infinitely large. That's the
setting for online learning where you're
sitting in front of a data source that
continuously spits data points at the
algorithm. For such settings which are
also known as time series we need
algorithms that have
a complexity computational complexity
that is constant per datim which means
it's linear in the size of the data set
and the it turns out that the gausian
linear case actually has a special
variant designed for such settings which
also builds the foundation
for inference in time series
and such models are called marov chains
or marov marovian models. They um make
the assumption that the dynamics of the
data set can be described in terms of
some state space that contains variables x
x
that has the property. So being a state
space means that everything that happens
into the future is conditionally
independent of the past if you know the
state. That's what this mark of property
says up here. And then if you
additionally make observation make the
assumption that our observations at any
point in time only depend on the state
only on the local state of the system
then that's this additional observation
then we found through some quite tedious
derivation that in such settings there
is a generic algorithmic structure for
which we have don't have to make further
assumptions about the shape of the
probability distributions that um
provides a local predictive update. So
you can locally predict the future from
the past. That's called the Chapman
colmograph equation. It has this
particular form. It's effectively a recursion.
recursion.
Then a local update using base theorem.
And those two together are the algorithm
for continuous prediction in time
series. And there's a special setting
which is sometimes confusing because it
like I also did tends to be taught alongside
alongside
the prediction update loop which is that
if you happen to have a finite data set
of some long length then once you're at
the end of the data set and there's no
more future to come you might want to
start considering what happened in the
past and reconsider your old predictions
in light of what happened later and that
involves a process of passing backwards
through time which is called smoothing.
Smoothing really only makes sense if the
data set is finite because otherwise we
end up with a quadratic cost a super
linear cost in time. All right? So
because as as we get more data points we
have to send messages backwards through
time to ever more data points and that
doesn't scale.
But the forward prediction is always
>> Yes. So um so the question is why don't
I do this in a windowed fashion? Yes,
you can absolutely do that. So if you
have like a tail of the last
B observations that you drag behind you,
you can always do a a smoothing pass
with a limit towards B. Right? So you
say you I'm dragging the last B
observations behind me for which I want
want to have smoothed observations and
then that's I think well if you want to
be sort of mathematics mathematician
about that you could say that's like
making the state space larger and then
you're again just filtering and not
smoothing like you're like increasing
the size of the state space by size B
basically or yeah it's it's really it's
just not so so the point of smoothing is
that afterwards you get a consistent posterior
posterior
that amounts to a joint distribution
over all the states and that you don't
get in a windowed fashion right that
sort of breaks this idea.
So um
uh yeah so uh uh here's another equ
another slide that doesn't help much
that just shows that this these forward
backward loops um exist and so that's
again an abstract statement. We actually
also discovered that this abstract
structure can be uh like there's a very
efficient like interesting structure to
it. So this this this update here
actually the joint of these two. So
first this then this. If you write this
as one equation um happens to be have
the the property of a um of an
associative scan of a uh yeah
associative uh what's the word for that?
The other the other word well
associative scan. That's the one one
technical term for it. And that means on
parallel hardware
if you have a finite finitely long data
set the cost of this can actually be
dropped to logarithmic complexity in the
length of the of the data set the the
the time series assuming you have an
unbounded number of compute threats
available. But that was just like a
interesting side observation because
this is a relatively new observation
that only begins to be used now in implementations.
implementations.
Um then now the final thing I want to
say about um um these these these markup
models is that of course these are again
abstract this is just abstract
equations. So it's in some sense not
particularly concrete but if we again
make gausian linear assumptions.
So if we assume that these three
distributions that we need to define the
mark of chain the initial distribution
the predictive distribution and the
observation model if those are all
linear gausian. So if they're all of
gausian form and the relationships
between variables involve linear maps
and then the the names of these
variables have canonical um meaning then
then
we arrive at a tractable algorithm that
is the foundation of signal processing
called the kalman filter and the
associated smoother is called the tribel
smoother and they have very classic
forms. Um it can be written like this
for the smoother like this for the
filter. This is not the only way to
write them. There are other ways of
writing them. For example, in terms of
the inverse coariance that's called
information form or the update here can
be written in a slightly different way
but they all end up being the same
algorithm. Um and that algorithm can be
used that class of algorithms can be
used in a very general set of
applications of uh things changing
across time. Basically the which
includes the modeling of dynamical systems.
systems.
If relationships between such variables
across time are not gausian and not
linear then what people in this signal
processing community the original like
the foundations of control theory and
rocket science and so on which are very
very well studied they then what they
then tend to do largely is to linearize
in some form or another. So to project
back from the nonlinear shape onto an
instance of Kalman filter and smoother
where we now basically create maps from
the nonlinear model to these particular
parameters to A and Q and H and R.
This as I pointed out actually includes
interestingly a large class of solvers
solvers
simulation methods for dynamical systems
driven by ordinary and partial
differential equations.
So solvers for ODEs
actually can be thought of as Kalman
filters. This the connection is
particularly precise for one class of OD
solvers. They're called Nseek methods.
But actually the majority of ODE solvers
are in some form like sometimes more
sometimes less direct instances of Kman
filters including gumakuta methods and
other methods that you may have
encountered in um numerical analysis
classes. That's useful because it it
allows us to connect information from
empirical data observations Y and
information from mechanisms knowing that
the system follows a differential
equation jointly.
Now we're halfway through the term and
I'm also halfway through the lecture. So
now I could take a break um unless
No, there's no question. So, this was
the Gausian part of the of the course
and then after the break, we're going to
uh swap to the
approximate non-gausian part of the
course. So, this was um
the the well roughly first half of the
of the term and of the course. I
remember that back then you you kind of
it felt to me like you were getting a
bit antsy about all this gausian stuff.
It's like when are we going to move away
from gausian? I understand this. I also
hope that now in hindsight now that we
are we are at the end of term you may
understand why I hammered on so much
about gausian. The reason is that as an
underlying structure mapping inference
to linear algebra they stayed with us of
course for the rest of time and whenever
we did something that's non- gausian we
ended up somehow mapping it back to gausian
gausian
because then things become tractable
again and we started this journey very
explicitly trying to expand away from
gausian and said what what kind of if we
allow ourselves like knowing that we
can't allow out general foreign
probability distributions over
continuous spaces because they give
intractable posteriors.
Is there something larger a more
powerful general class than just the
gausian the parameterized family of
distributions that have a mean and a
covariance in this sense. Um
beyond that that still allow us to do
something tractable and we arrived at
this concept of an exponential family.
An exponential family is a family of
probability distributions that can be
parameterized by a set of parameters W.
It has this form. So, and this is at
first again a bit confusing or like
something you have to stare at for a
long time. But actually every expression
in here has a very specific role. So,
this is a probability distribution that
is the exponential of that gives us the
name an inner product between something
that only depends on X and the
parameters W. So this is clearly linear
Those things that are nonlinear in X and
only evaluate X are called the
sufficient statistics.
And then a function that gives us that
ensures that this expression is when you
integrate over X is equal to one. So a
normalization constant, a log
normalization constant called the log
partition function because it's not
actually a constant in terms of W. It's
a function of W, but it's a constant in
terms of X plus a base measure,
something that doesn't involve W, but
just kind of gets added on literally.
So, sufficient statistics, natural
parameters, log partition function, and
carrier measure or base measure are
names for this. So we um the first thing
for which I don't have a slide that we
dis discovered is that many of the
classic distributions you will have
encountered in a statistics task like
the Bernoli the beta the dishly the
gamma the exponential the laplas
distribution the goombble distribution
the compat distri whatever all of these
distributions that you have encountered
in stats classes are of this form
including the gausian actually and they
just have different sufficient
statistics and associated natural
parameters. So that's maybe interesting.
But why is this? Well, we in a sort of
backhanded way we discovered that this
particular algebraic structure
and gender some really useful functionality.
functionality.
The first one is that all exponential families
families
have a conjugate prior.
So they allow us the construction of a
closure from the computer science
perspective. So where there is another
distribution. So for an exponential
family over the variable x given
parameters w there is another
distribution over w with new parameters
alpha and mu which is also an
exponential family. It just has
different very specific sufficient
statistics and a potentially unknown log
partition function
such that when you multiply these two
together as a function of W. So that's
what you do in Beijian inference, you
get another function of W that has the
same algebraic form. So it's another
member of this exponential family.
So making observations the the the
process of Beijian inference
is an operation under which this
exponential family is closed when basian
inference is identified by this likelihood.
likelihood.
And in particular we can also make
predictions about future data points by
marginalizing over the data using the
structure of exponential families. And
we discovered that what we have to do is
to effectively evaluate a difference
between two uh log partition functions
of the conjugate prior. So that means
that exponential families endow
endow
a a um every exponential family endows a
conjugate prior that spans a space of
probability distributions that are
closed under observations with our
original exponential family as the likelihood.
likelihood.
That's powerful because it means that
there is some kind of structured space
of distributions in which we can do
efficient inference. And what does
efficient mean? It means that we just
add up sufficient statistics and count
how many observations we've seen. That's
the operation we need to do. And that's
also why the word sufficient statistics
arises. That's what we need to do to do
inference, right? We just that those are
the only things we need to store is the
number of observations and their
sufficient statistics which is a
constant inmemory operation.
so can't the sufficient statistics be
arbitrarily complicated? Well, so the
sufficient statistics are identified by
the original likelihood, right? Right?
So they're in here and so they remain as
complicated as we choose them in the
original likelihood. So if this is a
gausian for example with gausian rx
exponential family then this ph of x is
um x and xxrpose.
So the empirical mean and empirical
coariance if you like up to normalization
normalization
um and then that's the stuff you need to
store. If it's uh
a categorical distribution
then the conjugate prior is a dish and
five of x is literally just x. So that
is easy but of course there you can
think of other exponential families that
for which this might be harder and in
fact we encountered some in deep
learning later on in the instance of
linear models where this object is also
completely interactable.
So in addition to full bashian inference
as a closure in uh exponential families
we get even more. We get all sorts of properties
properties
of the distribution
um that are commonly of interest in
um information theory, information
geometry, um statistics
like expected values of the sufficient
statistics like the Fisher information
matrix like the entropy of the
distribution, the cross entropy between
two members of the exponential family,
Kulber divergences between the two
members of a family, the score, the
expected score and so on and so on. We
get all of these
in terms of derivatives of sometimes
higher order of the lock partition function
function
up to the base measure and the base
measure the carrier measure might be
zero and then that doesn't matter at
all. Um so these things here on the left
they tend to be integrals. So they tend
to involve expected values
but we can evaluate them without doing
integrals using automatic
differentiation of the log partition
function. So I sort of quipped back then
that if you have an exponential family,
you could write a generator for the
Wikipedia article that populates all the
fields of the Wikipedia article for your
probability distribution by using these
properties. Basically symbolic
differentiation of a log partition
function tells you everything you need
to know. And this may be historically
the reason where all of these
distributions came from because the
people who invented them needed to do
Bashian inference. Whether they called
it that way or not doesn't matter. or
they needed to do lubian inference or at
least map estim estimation and then
quantifying confidence using all of
these properties. And so they ended up
with probability distributions in which
they could do these computations
efficiently using you know derivatives
of the log partition function.
But that's only true if we have the log
partition function. So a positive way of
reading this is if someone gives you one
integral the partition function which is
effectively an integral then you can get
everything here for free. But the other
way around you could also say to get all
of this you need to be able to integrate
over the sufficient statistics.
And so what happens if we're
encountering models
in which that are maybe probability
distributions of some form that can be
thought of as linear in some parameters
but nonlinear in the inputs such that
this object which is given by let me go
back again um like the logarithm of this
integral is not tractable
in such a situation. arises
for example in machine learning in
supervised machine learning if we do
classification or regression with a
general linear model. So we have inputs
X of which we evaluate some features
that we have either manually chosen or
that an optimizer picked for us by
parameterizing uh these features and
then following some gradient descent
routine or some optimization routine. So
that we get a general set of features
here and we can't compute this
normalization constant then. So for that
we needed our first instance of an
approximate inference method
and the one that I proposed which is
maybe the most daring but also the
fastest and maybe most general approach
is that a plus approximation.
So that's the very first tool in your
toolbox for approximate inference which
is based on a differential analysis of a
probability distribution. And with the
differential I mean that it's a local
description of the distribution in terms
of a point and its geometry
gradients hessions jacobians around that
point or at evaluated at that point and
this works as follows. So consider any
model that amounts to um like where
where you find an estimate of the
parameters through some form of
empirical risk minimization that might
involve a regularizer. It might not
involve a regularizer. Here's this the
equation that I used for the deep
learning lecture which is an instance of
empirical risk minimization. But we
could think of the very same situation
for exponential families that we take a
logarithm of an exponential family.
there will be a phi of x and then a w in
here. So now theta is equal to w and
then um we might this regularizer might
amount to this uh to this conjugate
prior right which is another function of
w that is um here and now we need to
find a postivia
over theta. we first realize that
typically the this min this minimization
problem that we write down here actually
is a log posterior. So for deep learning
for example the the this these loss
functions that people tend to use are
for example this L2 loss or the log loss
cross entropy um these are all instances
of logarithms of likelihoods categorical
likelihoods gausian likelihoods
and these regularizers that people tend
to use like um L2 are logarithms of
other distributions for example gausian
and so we can think of this expression
as a unnormalized lock posterior distribution.
distribution.
And now what we can do to get an
approximation to the associated
posterior is we find a mode of this
loss. So a minimizer or a maximizer
depending on how we put a minus in or
not either maximize or minimize
to find a theta star.
That's our point estimate.
And then at this point
we do a second order approximation of
the loss. I should have moved the slide
a second earlier. So this is actually
what I just said. So consider a loss
which you know to be a lo posterior or
maybe a negative log posterior. Then
find a minimizer of this loss at this
point in the loss. Eval like that's a
minimizer is a point where the gradient
is zero. do a tailor expansion to second order.
order.
We assume that we can do that because
the loss is sufficiently regular. So
that's the only constraint we actually
have. We have to assume that this thing
exists that we have found a point where
the gradient is zero. So there is a mode
and then at that point we can evaluate
this object. So that's the second
derivative of the loss the hashen in general.
general.
Then if we look at this expression we
see that it's a quadratic form. There's
no linear term because the gradient is
zero and there's a constant term in
front. We have deliberately dropped
higher order terms. This there's a like
a decision that we take to drop these
higher order terms. And then if we since
we know that this expression this on the
left hand side is the logarithm of a
posterior then that means we can the
posterior is the exponential of sorry
minus this expression.
So then on the right hand side if you do
the same thing if you take an
exponential of this we get a constant
and then times e to the minus 12 quadratic form and that we can identify
quadratic form and that we can identify with a gausian distribution centered at
with a gausian distribution centered at the mode with uh okay yeah I copied that
the mode with uh okay yeah I copied that over that's not good. So this is
over that's not good. So this is obviously the inverse of this um the
obviously the inverse of this um the inverse of the hashen.
inverse of the hashen. And now there we go. We have a gausian
And now there we go. We have a gausian and we get a gausian distribution
and we get a gausian distribution purely using differentiation. There's no
purely using differentiation. There's no integrals being solved, no random
integrals being solved, no random numbers being drawn, no um what fanciful
numbers being drawn, no um what fanciful partial evaluation of some integral as
partial evaluation of some integral as we're going to have in a second, but
we're going to have in a second, but instead just automatic differentiation
instead just automatic differentiation paired with optimization and linear
paired with optimization and linear algebra which are tools that are
algebra which are tools that are commonly available in like across
commonly available in like across machine learning, differentiable
machine learning, differentiable programming including deep learning.
programming including deep learning. So, Laasa approximations are a nearly
So, Laasa approximations are a nearly universal tool to construct a gausian
universal tool to construct a gausian postivity on anything including on
postivity on anything including on exponential families but also on entire
exponential families but also on entire deep neural networks. And then I did
deep neural networks. And then I did these like one or two or three lectures
these like one or two or three lectures on how to apply this mechanism to deep
on how to apply this mechanism to deep neural networks as maybe the most direct
neural networks as maybe the most direct bridge from the classic theory of
bridge from the classic theory of probabilistic machine learning to the
probabilistic machine learning to the more modern
more modern or currently more prominent more hyped
or currently more prominent more hyped dimension of uh different differentiable
dimension of uh different differentiable programming deep learning. So you can
programming deep learning. So you can actually endow any deep neural network
actually endow any deep neural network with an associated gausian process
with an associated gausian process posterior by taking the deep neural
posterior by taking the deep neural network first training it in any which
network first training it in any which way you like. That's step number one.
way you like. That's step number one. When you do that you minimize an
When you do that you minimize an empirical risk typically a regularized
empirical risk typically a regularized empirical risk but you don't even
empirical risk but you don't even necessarily need a regularizer. You just
necessarily need a regularizer. You just find a mode of the loss.
find a mode of the loss. Then
Then at the mode we make the like mental
at the mode we make the like mental assumption that we can linearize the
assumption that we can linearize the network the deep neural network f which
network the deep neural network f which is a biariate function of inputs and its
is a biariate function of inputs and its weights as an evaluation of that
weights as an evaluation of that function with a fixed set of parameters
function with a fixed set of parameters namely the trained weights that's still
namely the trained weights that's still a function of the inputs that's your
a function of the inputs that's your trained deep neural network plus it's
trained deep neural network plus it's Jacobian with respect to the weights
Jacobian with respect to the weights times a residual to the uh trained
times a residual to the uh trained weights which we can think of as a
weights which we can think of as a feature map
feature map um created by the network using
um created by the network using automatic differentiation. That's just a
automatic differentiation. That's just a conceptual step. We don't we don't
conceptual step. We don't we don't actually implement anything yet. But we
actually implement anything yet. But we realize that this function we have it's
realize that this function we have it's called the trained network and this
called the trained network and this function we can easily get in a deep
function we can easily get in a deep learning framework because it's just the
learning framework because it's just the thing you need to do gradient descent
thing you need to do gradient descent the backward pass basically.
the backward pass basically. then you do a laplas approximation in
then you do a laplas approximation in the way I just described. Um if you do
the way I just described. Um if you do this then actually this hession
this then actually this hession simplifies a bit. It gets a more
simplifies a bit. It gets a more restrictive structure that's nasty to
restrictive structure that's nasty to write down and also sometimes nasty to
write down and also sometimes nasty to implement but it's easier to do than the
implement but it's easier to do than the full hessian.
full hessian. And the laplas approximation gives us a
And the laplas approximation gives us a gausian distribution over uh the
gausian distribution over uh the parameters. So now the last thing we
parameters. So now the last thing we need to do is to marginalize out those
need to do is to marginalize out those gausian distributions on the parameter
gausian distributions on the parameter space on the weights using this
space on the weights using this linearization assumption for the network
linearization assumption for the network and these four steps together tell us
and these four steps together tell us that we actually have a gausian process
that we actually have a gausian process postivia associated with our trade
postivia associated with our trade neural network. I still find this a very
neural network. I still find this a very interesting perspective
interesting perspective is like precisely because it is so
is like precisely because it is so simple. So there are other ways of doing
simple. So there are other ways of doing adding uncertainty to deep learning that
adding uncertainty to deep learning that tend to be more costly and less direct.
tend to be more costly and less direct. This is a very maybe it's a perhaps a uh
This is a very maybe it's a perhaps a uh uh somewhat daring thing to do because
uh somewhat daring thing to do because it's a very local approximation, right?
it's a very local approximation, right? might just be a very bad approximation
might just be a very bad approximation to the posterior because it only takes
to the posterior because it only takes the local shape into account. But it
the local shape into account. But it provides us with very tractable and very
provides us with very tractable and very interesting structures that I recommend
interesting structures that I recommend anyone doing deep learning should have
anyone doing deep learning should have at the back of their mind just to
at the back of their mind just to analyze their model. If you train a deep
analyze their model. If you train a deep neural network, you can think of a
neural network, you can think of a gausian process that you've you have
gausian process that you've you have constructed that has a mean function
constructed that has a mean function given by the network you just trained
given by the network you just trained and a co-variance function a kernel that
and a co-variance function a kernel that involves these green objects which are
involves these green objects which are the feature functions of your network
the feature functions of your network the jacobians at the trained point and
the jacobians at the trained point and this red thing this inverse covariance
this red thing this inverse covariance inverse hessen inverse curvature matrix
inverse hessen inverse curvature matrix um of the loss at theta R which is the
um of the loss at theta R which is the the bit of this framework that brings
the bit of this framework that brings the data into the uncertainty
the data into the uncertainty quantification because it contains the
quantification because it contains the loss on the training data. So these are
loss on the training data. So these are data structures that you can think of as
data structures that you can think of as representing the data set. So the
representing the data set. So the knowledge extracted from the data set
knowledge extracted from the data set its relationship to the function we are
its relationship to the function we are operating on that's these features and
operating on that's these features and then centered on the predictions that
then centered on the predictions that the network already does. And I did a
the network already does. And I did a whole lecture on like trying to argue
whole lecture on like trying to argue that this kind of functionality is
that this kind of functionality is practically useful that you can use it
practically useful that you can use it to do things like retrain parameter
to do things like retrain parameter efficient fine-tuning um uh
efficient fine-tuning um uh visualization of uncertainty
visualization of uncertainty uh
uh explainable machine learning and so on
explainable machine learning and so on and so on.
and so on. Okay, now um we're getting towards the
Okay, now um we're getting towards the more final parts of the lecture, but
more final parts of the lecture, but there's a question first.
Ah, you're asking a very good question. So for this to work, this matrix has to
So for this to work, this matrix has to be positive definite. And you're right
be positive definite. And you're right that in general this session may not be
that in general this session may not be positive definite.
positive definite. So if we do this linearization
So if we do this linearization um then this red matrix actually is
um then this red matrix actually is positive definite by construction.
positive definite by construction. So that's a nice aspect of this. Um that
So that's a nice aspect of this. Um that means at least you know that the thing
means at least you know that the thing you get will be a gausian process and it
you get will be a gausian process and it won't be nasty behaved. But of course it
won't be nasty behaved. But of course it might miss some important bits that come
might miss some important bits that come from this linearization. Um so that you
from this linearization. Um so that you we're basically dropping some higher
we're basically dropping some higher order terms in f
order terms in f and in fact it may well be in the future
and in fact it may well be in the future an interesting thing to study because
an interesting thing to study because what does it mean to have a hash that's
what does it mean to have a hash that's not positive definite means you're not
not positive definite means you're not actually at a minimum right? there's a
actually at a minimum right? there's a direction along which you can change
direction along which you can change your network and either you don't change
your network and either you don't change anything then it is invariant in some
anything then it is invariant in some direction that's I think a very common
direction that's I think a very common known aspect of deep learning or in fact
known aspect of deep learning or in fact you at a settle point and then there's a
you at a settle point and then there's a direction to take along which the loss
direction to take along which the loss will go down and then might be good to
will go down and then might be good to know about this right so yeah um totally
know about this right so yeah um totally good point but I think it's not a
good point but I think it's not a problem of the probabilistic
problem of the probabilistic interpretation it's just the general
interpretation it's just the general aspect of deep learning that that we're
aspect of deep learning that that we're posing non-convex optimization problems
posing non-convex optimization problems and uh then we we inherit problems like
and uh then we we inherit problems like this.
this. Okay, so that was the uncertainty
Okay, so that was the uncertainty quantification for deep learning part.
quantification for deep learning part. Now I'm going to quickly do the final
Now I'm going to quickly do the final few things which were admittedly a bit
few things which were admittedly a bit of a rectac collection of of of things
of a rectac collection of of of things that I felt like I have to do in a
that I felt like I have to do in a probabilistic machine learning class.
probabilistic machine learning class. The first one was sampling methods. So
The first one was sampling methods. So sampling methods are algorithms that use
sampling methods are algorithms that use random numbers. I started by pointing
random numbers. I started by pointing out that we often need expected values
out that we often need expected values in uh in statistical inference basian
in uh in statistical inference basian inference and in statistical inference.
inference and in statistical inference. So such expected values can actually be
So such expected values can actually be approximated using what's called the
approximated using what's called the Monte Carlo estimate. And what this is
Monte Carlo estimate. And what this is is a is a sum
is a is a sum or finitely many evaluations of the
or finitely many evaluations of the function whose expectation we're trying
function whose expectation we're trying to construct
to construct using evaluation points that are iid
using evaluation points that are iid samples from the distribution P.
samples from the distribution P. And I put IID samples in quotation marks
And I put IID samples in quotation marks because I did a big spiel for a whole
because I did a big spiel for a whole 90-minute lecture that confused some of
90-minute lecture that confused some of you and some of you kind of liked it to
you and some of you kind of liked it to point out that this idea of IID
point out that this idea of IID randomness is very flawed as a concept.
randomness is very flawed as a concept. In fact, it it relies on things that
In fact, it it relies on things that fundamentally cannot be computed which
fundamentally cannot be computed which are then approximated with things that
are then approximated with things that aren't actually random and it's all a
aren't actually random and it's all a bit weird. Um nevertheless this use of
bit weird. Um nevertheless this use of these pseudo random numbers has become
these pseudo random numbers has become very common in machine learning and we
very common in machine learning and we do it all the time even though and
do it all the time even though and actually we tend to do it in a totally
actually we tend to do it in a totally deterministic fashion using fixed random
deterministic fashion using fixed random keys so that we can run the code several
keys so that we can run the code several times and it produces the same plot
times and it produces the same plot every time. So there isn't actually any
every time. So there isn't actually any randomness but whatever. Um
randomness but whatever. Um and this this philosophical problem also
and this this philosophical problem also actually carries over into the analysis
actually carries over into the analysis of these methods. There is a classic
of these methods. There is a classic argument that Monte Carlo estimates are
argument that Monte Carlo estimates are in some sense optimal because they
in some sense optimal because they achieve the optimal convergence rate of
achieve the optimal convergence rate of an unbiased estimator, the chromaro
an unbiased estimator, the chromaro bound. Um, however, that bound is
bound. Um, however, that bound is fundamentally already using the idea of
fundamentally already using the idea of randomness. So, it turns out that at
randomness. So, it turns out that at least for some problems, there are
least for some problems, there are actually estimates that you could call
actually estimates that you could call biased if you like, but it's not really
biased if you like, but it's not really a concept that makes much sense if
a concept that makes much sense if you're using a deterministic quantity
you're using a deterministic quantity that converge much much faster.
that converge much much faster. Nevertheless, I don't want you to think
Nevertheless, I don't want you to think that Monte Carlo methods are completely
that Monte Carlo methods are completely stupid or shouldn't be used. I'm using
stupid or shouldn't be used. I'm using sampling methods in some of my work as
sampling methods in some of my work as well. They're just sometimes very
well. They're just sometimes very convenient. Maybe in your head, you can
convenient. Maybe in your head, you can think of Monte Carlo methods as
think of Monte Carlo methods as algorithms that are the the first thing
algorithms that are the the first thing you might want to try, but the last
you might want to try, but the last thing you should actually do. By which I
thing you should actually do. By which I mean that they can be um very easy to
mean that they can be um very easy to implement. uh often they just need
implement. uh often they just need access to um p an unnormalized
access to um p an unnormalized probability density function which is
probability density function which is often easy to evaluate. For example, if
often easy to evaluate. For example, if you can just multiply some prior with
you can just multiply some prior with some likelihood no matter what the
some likelihood no matter what the likelihood and the prior is you have
likelihood and the prior is you have these objects and then you can just call
these objects and then you can just call these methods and let them run
these methods and let them run and then they tend to provide samples
and then they tend to provide samples that converge reasonably quickly.
that converge reasonably quickly. They're often imple easy to implement
They're often imple easy to implement for in particular quite unstructured
for in particular quite unstructured problems where you just don't know what
problems where you just don't know what to do elsewise. Um we encountered
to do elsewise. Um we encountered several such methods like well first of
several such methods like well first of all the basic method that just uses
all the basic method that just uses direct samples from a distribution
direct samples from a distribution that's sometimes called simple Monte
that's sometimes called simple Monte Carlo and then algorithms that are more
Carlo and then algorithms that are more approximate like rejection sampling
approximate like rejection sampling important sampling all sorts of mark of
important sampling all sorts of mark of chair Monte Carlo methods like um the
chair Monte Carlo methods like um the basic class Mopolis Hastings GIP
basic class Mopolis Hastings GIP sampling Hamiltonian Monte Carlo Monte
sampling Hamiltonian Monte Carlo Monte Carlo and so on um which uh are all
Carlo and so on um which uh are all increasingly complicated approximations
increasingly complicated approximations that um only asytoically actually draw
that um only asytoically actually draw IID from the distribution. Nevertheless,
IID from the distribution. Nevertheless, these are easy to implement and then
these are easy to implement and then they quickly provide a bunch of numbers
they quickly provide a bunch of numbers that give a rough estimate one over
that give a rough estimate one over square root of number of samples
square root of number of samples convergence for the thing you're trying
convergence for the thing you're trying to estimate. And if that's what you
to estimate. And if that's what you going for, that's good. If you want to
going for, that's good. If you want to have a highly performant algorithm that
have a highly performant algorithm that um can run on large data sets or on many
um can run on large data sets or on many many instances of the same problem,
many instances of the same problem, something else might be a better idea.
talking about something else. I then because I was talking about random
because I was talking about random numbers already had a lecture on
numbers already had a lecture on generative models on diffusion and flow
generative models on diffusion and flow matching models which are now the
matching models which are now the state-of-the-art at the moment for the
state-of-the-art at the moment for the generation of natural images and
generation of natural images and everyone has used one of them probably
everyone has used one of them probably already. So we were interested together
already. So we were interested together in how they worked and one way to
in how they worked and one way to address them as I argued back then is
address them as I argued back then is from the sampling perspective. We may
from the sampling perspective. We may discover that um sampling from a
discover that um sampling from a complicated distribution over some
complicated distribution over some variable X0 is uh hard even for very
variable X0 is uh hard even for very good sampling methods like advanced mark
good sampling methods like advanced mark of chain Monte Carlo methods where
of chain Monte Carlo methods where complicated might might mean but doesn't
complicated might might mean but doesn't have to mean that the distribution has
have to mean that the distribution has islands. it only has a few points on
islands. it only has a few points on which it has non-zero mass. It might
which it has non-zero mass. It might also mean that it's spans some kind of
also mean that it's spans some kind of manifold that is very slim within the
manifold that is very slim within the space uh parameterized by your x that
space uh parameterized by your x that you would like to sample from. In such
you would like to sample from. In such situations,
situations, one operation that is quite interesting
one operation that is quite interesting that is not part of classic mark of shim
that is not part of classic mark of shim on the color methods is to take the
on the color methods is to take the distribution over x0 and convolve it
distribution over x0 and convolve it with a gausian. So that means you create
with a gausian. So that means you create a new random variable y which is the old
a new random variable y which is the old random variable x0 plus gausian noise.
random variable x0 plus gausian noise. When we do that the new dist the new
When we do that the new dist the new variable y tends to have a distribution
variable y tends to have a distribution that is easier because it looks a bit
that is easier because it looks a bit more gausian. So it's easier to draw
more gausian. So it's easier to draw from. It's a bit wider. It's a bit less
from. It's a bit wider. It's a bit less isolated with islands. And what we can
isolated with islands. And what we can do is we can let our Monte Carlo method
do is we can let our Monte Carlo method operate in this more noisy space, more
operate in this more noisy space, more regular space and then take a step and
regular space and then take a step and estimate what the corresponding x0 is by
estimate what the corresponding x0 is by trying to invert the noising process by
trying to invert the noising process by dnoising the sample. Now we know we
dnoising the sample. Now we know we notice that this isn't like this is a
notice that this isn't like this is a bit of a zero sum game because the more
bit of a zero sum game because the more noise you add, the easier you make the
noise you add, the easier you make the distribution to draw from, but the
distribution to draw from, but the harder you make the estimation process.
harder you make the estimation process. So the really cool idea is to say ah
So the really cool idea is to say ah maybe we just don't do one noisy step
maybe we just don't do one noisy step but we take a continuum or at least a
but we take a continuum or at least a large number of small noise steps that
large number of small noise steps that go from the original sample to something
go from the original sample to something that can be considered basically a
that can be considered basically a standard Gaussian distribution and then
standard Gaussian distribution and then instead of doing a single estimation
instead of doing a single estimation step backwards we do a whole long
step backwards we do a whole long sequential denoising process because
sequential denoising process because then every individual estimation step
then every individual estimation step might actually work quite well because
might actually work quite well because the local noise is very small. And to do
the local noise is very small. And to do that, we need access to a model that can
that, we need access to a model that can guess the noise basically. And we
guess the noise basically. And we discovered that that's related to
discovered that that's related to estimating the score of the um this
estimating the score of the um this noising distribution
noising distribution at any point in time t.
at any point in time t. And uh that can be learned for example
And uh that can be learned for example with some deep neural network which I
with some deep neural network which I showed a simple very simple example for
showed a simple very simple example for and of course now there are very
and of course now there are very elaborate such models that uh can
elaborate such models that uh can generate really powerful uh sampling
generate really powerful uh sampling algorithms that can sample from
algorithms that can sample from basically the manifold of natural images
basically the manifold of natural images even conditional on some uh word
even conditional on some uh word description of what should be in the
description of what should be in the image. There's a continuous limit of
image. There's a continuous limit of this that actually is for a noise
this that actually is for a noise process is a stocastic differential
process is a stocastic differential equation and that backward dnoising
equation and that backward dnoising process follows the score of the
process follows the score of the probability distribution which can also
probability distribution which can also be phrased in terms of an ordinary
be phrased in terms of an ordinary differential equation without noise
differential equation without noise where you basically just draw samples
where you basically just draw samples once and then the dynamics of the entire
once and then the dynamics of the entire set of samples follows an OD that's
set of samples follows an OD that's called the probability flow and such
called the probability flow and such models that use this OD are called flow
models that use this OD are called flow matching models. and are now maybe the
matching models. and are now maybe the state-of-the-art at the moment at least
state-of-the-art at the moment at least at the point of talking um for gener
at the point of talking um for gener generation of samples from complicated
generation of samples from complicated distributions.
distributions. Okay, that was the generative modeling
Okay, that was the generative modeling part and there's questions for that.
part and there's questions for that. >> Xer
say again shouldn't Oh, is this is the the verb wrong?
We say it becomes harder as alpha increases in the second point.
increases in the second point. >> Ah, as al yes decreases I'm sorry you're
>> Ah, as al yes decreases I'm sorry you're right. So as alpha decreases we get
right. So as alpha decreases we get closer to a standard gausian right this
closer to a standard gausian right this this becomes a one this becomes a zero.
this becomes a one this becomes a zero. Yeah you're right. Um yeah.
Yeah you're right. Um yeah. Okay. So now um in the very final three
Okay. So now um in the very final three lectures of the of term I introduced
lectures of the of term I introduced another concept for approximate
another concept for approximate inference which is historically very
inference which is historically very important and continues to be used today
important and continues to be used today in many applications often for
in many applications often for analytical purposes
analytical purposes but has become maybe less prominent than
but has become maybe less prominent than it was a while ago. And so I'm myself
it was a while ago. And so I'm myself also a bit uncertain about how much I
also a bit uncertain about how much I should teach about it. So I ended up
should teach about it. So I ended up investing well two and a half lectures
investing well two and a half lectures into it. And we started getting to to
into it. And we started getting to to this framework called variational
this framework called variational inference by first uh studying a even
inference by first uh studying a even more classic algorithm called EM. Em is
more classic algorithm called EM. Em is a quite specific algorithm for finding
a quite specific algorithm for finding maximum likelihood estimates in cases
maximum likelihood estimates in cases where the likelihood itself
where the likelihood itself of the data the marginal likelihood is
of the data the marginal likelihood is hard to evaluate but can be made easy to
hard to evaluate but can be made easy to evaluate if we add latent variable set.
evaluate if we add latent variable set. So this object without the integral
So this object without the integral around it just this in inside here is
around it just this in inside here is called the complete data log likelihood
called the complete data log likelihood for example we saw that this shows up in
for example we saw that this shows up in gausian mixture models or in general in
gausian mixture models or in general in mixture models actually in mixture
mixture models actually in mixture models this zed tends to be the variable
models this zed tends to be the variable that tells us which data belongs to
that tells us which data belongs to which mixture component and then
which mixture component and then everything becomes easily tractable.
everything becomes easily tractable. This algorithm consists of two steps
This algorithm consists of two steps that get iterated. In the first one, we
that get iterated. In the first one, we set a distribution over zed to the
set a distribution over zed to the posterior over zed given x at a
posterior over zed given x at a particular point in the parameter space.
particular point in the parameter space. And we can think of this as a sort of
And we can think of this as a sort of very important look ahead into
very important look ahead into variational inference as constructing an
variational inference as constructing an approximation to the postivia at the
approximation to the postivia at the correct value theta
correct value theta that happens to have the property that
that happens to have the property that it minimizes the kl divergence to the uh
it minimizes the kl divergence to the uh postivia at the current point in the
postivia at the current point in the parameter space theta and then in a
parameter space theta and then in a second step the so-called maximization
second step the so-called maximization step we change the parameters. So in
step we change the parameters. So in this in the estimation step in E we
this in the estimation step in E we estimate a latent variable set and then
estimate a latent variable set and then in the maximization step we change theta
in the maximization step we change theta we change the parameters to maximize
we change the parameters to maximize this expression which is the expected
this expression which is the expected complete data log likelihood an
complete data log likelihood an expectation under the postivio over the
expectation under the postivio over the logarithm of the complete log data
logarithm of the complete log data likelihood um plus a constant which is
likelihood um plus a constant which is the entropy of Q and um so we found that
the entropy of Q and um so we found that this object which we maximize
this object which we maximize is called the evidence lower bound
is called the evidence lower bound because it is a lower bound to the
because it is a lower bound to the evidence. So the evidence is this thing
evidence. So the evidence is this thing and the difference between this thing
and the difference between this thing and the evidence is a positive number
and the evidence is a positive number the KL divergence between our
the KL divergence between our approximation and the true posterior. So
approximation and the true posterior. So by setting locally at a particular point
by setting locally at a particular point in theta Q to this posterior we're
in theta Q to this posterior we're making this zero this bound becomes
making this zero this bound becomes tight and when we maximize it for theta
tight and when we maximize it for theta we n we have to also increase this value
we n we have to also increase this value and so we found an algorithm for
and so we found an algorithm for maximizing this likelihood without ever
maximizing this likelihood without ever actually evaluating this likelihood
actually evaluating this likelihood which is kind of cool. So now the next
which is kind of cool. So now the next step the final step for variational
step the final step for variational inference is to say huh maybe I can get
inference is to say huh maybe I can get away without setting this bound to make
away without setting this bound to make to exactly this. So without making this
to exactly this. So without making this k divergence zero but just by trying to
k divergence zero but just by trying to increase this object and thereby
increase this object and thereby minimizing this object because this is a
minimizing this object because this is a constant
constant right so in terms of zed at least right
right so in terms of zed at least right so when we try to maximize this we will
so when we try to maximize this we will find a q of zed that gets as close as
find a q of zed that gets as close as possible to the posterior and we can
possible to the posterior and we can think of that as an approximate
think of that as an approximate inference method for finding
inference method for finding distributions that are good
distributions that are good approximations to the posterior and
approximations to the posterior and these are called variation inference
these are called variation inference methods. Why? Because they vary
methods. Why? Because they vary an object Q that is a function. They are
an object Q that is a function. They are optimization algorithms that operate in
optimization algorithms that operate in the spaces of functions actually in the
the spaces of functions actually in the spaces of probability distributions Q
spaces of probability distributions Q rather than optimizing in zed to find
rather than optimizing in zed to find modes of distributions and then
modes of distributions and then describing them locally in terms of zed
describing them locally in terms of zed like tas approximations did.
like tas approximations did. But how do you quantify the space of
But how do you quantify the space of distributions over want which you want
distributions over want which you want to optimize? Well, an easy thing is that
to optimize? Well, an easy thing is that you could just set the you could decide
you could just set the you could decide that Q of set should be some parametric
that Q of set should be some parametric family like it should be a gausian for
family like it should be a gausian for example. Then you can just optimize the
example. Then you can just optimize the elbow directly. Okay, fine. Maybe you
elbow directly. Okay, fine. Maybe you want to do some integrals this integral
want to do some integrals this integral in there uh using Monte Carlo. That's
in there uh using Monte Carlo. That's called fixed form variational inference.
called fixed form variational inference. But there's actually a less restrictive
But there's actually a less restrictive form that often is surprisingly possible
form that often is surprisingly possible to do that only imposes a little bit of
to do that only imposes a little bit of um of algebraic structure on Q and this
um of algebraic structure on Q and this is called the mean field assumption. So
is called the mean field assumption. So we assume that our Q of Z which we want
we assume that our Q of Z which we want to be a good approximation to the true
to be a good approximation to the true posterior on Zed has to factoriize along
posterior on Zed has to factoriize along some set of variables.
some set of variables. If we do that then it turns out that the
If we do that then it turns out that the evidence lower bound the elbow can be
evidence lower bound the elbow can be written as a um as an expected value. So
written as a um as an expected value. So the evidence lower bound as a function
the evidence lower bound as a function of one particular of these variables
of one particular of these variables that J can be written as a an expected
that J can be written as a an expected value of the log complete data log
value of the log complete data log likelihood um under expectations over
likelihood um under expectations over the other variables.
the other variables. And because this is a function of Zj,
And because this is a function of Zj, even if we don't know how to do this
even if we don't know how to do this integral, we tend to be able to see from
integral, we tend to be able to see from the functional form of a generative
the functional form of a generative model what the um functional form of our
model what the um functional form of our approximation should be. And then if we
approximation should be. And then if we do that iteratively across all the
do that iteratively across all the variables, we are often able to do the
variables, we are often able to do the individual expectations because they
individual expectations because they become for example exponential families
become for example exponential families for which we know how to do the
for which we know how to do the expectations.
expectations. And that leads to iterative algorithms
And that leads to iterative algorithms that can sometimes be very efficient at
that can sometimes be very efficient at constructing like algorithmically
constructing like algorithmically efficient and distributions of high
efficient and distributions of high quality um for approximate inference.
quality um for approximate inference. Doing that requires writing down a lot
Doing that requires writing down a lot of equations on a piece of paper doing
of equations on a piece of paper doing some analysis for ourselves which is uh
some analysis for ourselves which is uh maybe like a last hold out of human
maybe like a last hold out of human labor in the times of AI.
And then on Tuesday I told you that this kind of framework if it's applied to
kind of framework if it's applied to mixture models can be thought of as
mixture models can be thought of as having something to do wly with
having something to do wly with attention mostly just to motivate you to
attention mostly just to motivate you to have a look with that I'm through term
have a look with that I'm through term but I'm not completely done. I have like
but I'm not completely done. I have like two minutes left. I'd like to summarize
two minutes left. I'd like to summarize on a very high level what I was trying
on a very high level what I was trying to get across for this time. This is for
to get across for this time. This is for many of you um one of the two two or
many of you um one of the two two or three key lectures in your master's
three key lectures in your master's degree on AI and machine learning and
degree on AI and machine learning and computer science. I wanted to endow you
computer science. I wanted to endow you with basic understanding for the
with basic understanding for the relationships between model classes, the
relationships between model classes, the functionality that is available in
functionality that is available in machine learning and how to think about
machine learning and how to think about it. Not so much just about
it. Not so much just about implementation. Of course, we did that
implementation. Of course, we did that as well. I wanted to convince you that
as well. I wanted to convince you that probability theory as a mathematical
probability theory as a mathematical framework provides the foundation and
framework provides the foundation and the framework for the design of machine
the framework for the design of machine learning methods because it formalizes
learning methods because it formalizes the process of extracting information
the process of extracting information from all sorts of data uh sorry all
from all sorts of data uh sorry all sorts of information and that might be
sorts of information and that might be empirical data but it may also be
empirical data but it may also be mechanistic knowledge like knowing that
mechanistic knowledge like knowing that there's a differential equation that
there's a differential equation that drives a system or computational
drives a system or computational information like knowing that you have
information like knowing that you have evaluated the gradient of a vector field
evaluated the gradient of a vector field a bunch of times and now you know a
a bunch of times and now you know a little bit about the vector field but
little bit about the vector field but not about the entire thing.
not about the entire thing. And because the mechanism of
And because the mechanism of probabilistic inference is fundamentally
probabilistic inference is fundamentally intractable, it requires us always to do
intractable, it requires us always to do approximations. Sometime someone asked
approximations. Sometime someone asked last week, isn't that a problem that we
last week, isn't that a problem that we always have to do these approximations?
always have to do these approximations? Everything always ends up gausian in
Everything always ends up gausian in your lecture. Why can we even trust
your lecture. Why can we even trust this? Well, it turns out that everyone
this? Well, it turns out that everyone in science has been doing this for at
in science has been doing this for at least the last two centuries, including
least the last two centuries, including Newton and Einstein and other people who
Newton and Einstein and other people who probably really knew better than most
probably really knew better than most others because you don't get around
others because you don't get around writing down like making approximations.
writing down like making approximations. If you want to write down something, you
If you want to write down something, you can actually do either on a piece of
can actually do either on a piece of paper or on a computer. And so all of
paper or on a computer. And so all of including modern machine learning is a
including modern machine learning is a form of approximation. the probabistic
form of approximation. the probabistic perspective just allows us to reflect on
perspective just allows us to reflect on this better.
this better. Um, and of course there are powerful
Um, and of course there are powerful mechanisms like the ones we discussed
mechanisms like the ones we discussed today and over the course of the term
today and over the course of the term that actually work that you can actually
that actually work that you can actually apply and they allow you to get insights
apply and they allow you to get insights either to build real tools like the
either to build real tools like the gausian process models we built or these
gausian process models we built or these exponential family models you or of
exponential family models you or of course the fusion models.
course the fusion models. But sometimes they also help us get
But sometimes they also help us get insights like this story about deep
insights like this story about deep learning this differential geometry
learning this differential geometry perspective on on on deep neural
perspective on on on deep neural networks that allows us to identify new
networks that allows us to identify new functionality for existing models. So
functionality for existing models. So for example in deep learning there are
for example in deep learning there are classes of methods that were not were
classes of methods that were not were invented in a way that had nothing to do
invented in a way that had nothing to do with Beijian inference. But now that we
with Beijian inference. But now that we think about them as Beijian models um we
think about them as Beijian models um we discover that we can use these insights
discover that we can use these insights to create new functionality like
to create new functionality like continuous learning for example.
continuous learning for example. That's what I wanted to say at the end.
That's what I wanted to say at the end. Now I'm one minute over time. If you
Now I'm one minute over time. If you have more questions, you can also come
have more questions, you can also come to the front. I'm grateful that you were
to the front. I'm grateful that you were here all all the time. Um um and I'm
here all all the time. Um um and I'm looking forward to see many of you on
looking forward to see many of you on Monday at 8:00 a.m. for the exam. Thank
Monday at 8:00 a.m. for the exam. Thank you very much.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.