Predictive coding offers a biologically plausible alternative to backpropagation for artificial neural networks, explaining how brains might learn effectively by minimizing prediction errors in a continuous, local, and parallel manner.
Mind Map
点击展开
点击探索完整互动思维导图
Of all the mysteries that the human
brain presents, perhaps one stands above
the rest. How does it learn so
effectively? In the world of artificial
intelligence, scientists and engineers
have spent decades trying to replicate
the brain's learning mechanisms. Their
efforts led to back propagation with
gradient descent, the workhorse
algorithm that powers virtually the
entire field of machine learning today.
Due to its remarkable success,
researchers began to speculate that
perhaps brains do something similar.
However, there is a fundamental problem.
The back propagation algorithm
contradicts essential biological
principles of brain function, making its
exact implementation in neural tissue
virtually impossible.
In recent years, however, an alternative
algorithm called predictive coding has
emerged that is not only more aligned
with the brain's biological hardware,
but sometimes can work even better than
the back propagation itself. In this
video, we will build predictive coding
from first principles. Explore what
issues of biological plausibility it
addresses and how it might inspire the
The fundamental challenge that
computational systems must solve is
called credit assignment. When you have
a system with numerous parameters like
connection weights between neurons that
can be adjusted to achieve a desired
output such as recognizing objects in an
image or executing appropriate actions.
How do you determine which parameters to
adjust and by how much? Artificial
neural networks solve this elegantly
through what's called automatic
differentiation. Because the entire
computation can be represented as a
mathematical function, computers use
calculus, particularly the chain rule of
derivatives, to calculate precisely how
each parameter should be nudged to
guarantee improvement in performance. If
you're interested in a deep step-by-step
derivation of how back propagation
works, I've covered this in one of my
earlier videos. However, despite its
remarkable success in machine learning,
evidence suggests that the brain almost
certainly uses a different approach.
There are various reasons why back
propagation doesn't map directly onto
neural hardware, but luckily most of
them have biologically plausible
workarounds. What is crucial for our
discussion today and why I'm extremely
excited about predictive coding is that
it addresses two fundamental constraints
that are absolutely incompatible with
neurohysiology and which are the biggest
reasons why brains cannot perform back
prop namely lack of local autonomy and
discontinuous processing. Sounds
confusing. So let's unpack what this means.
Artificial networks operate in strictly
separated phases that alternate
sequentially. First, information flows
forward. Input propagates across layers
to the output, generating a prediction.
Next, this prediction is compared
against the desired outcome, calculating
an error. Then comes the crucial
backward pass. This error travels back
through the network layer by layer
determining precisely how each weight
should change to reduce future errors.
Finally, all weights update
simultaneously and the cycle repeats
with a new training example. For this
process to work, neurons must
essentially freeze their feed forward
activity values like taking snapshots of
activity and holding on to them while
error signals flow backward. But our
brains don't work like this. They don't
hit pause between thinking and learning.
Communication in biological tissue is
relatively slow compared to silicon
processors. If the brain followed back
propagation approach, it would have to
completely stop information processing
for hundreds of milliseconds before
performing the backward pass to update
connections. Imagine experiencing brief
blackouts every time you learn something new.
new.
Instead, biological brains process
information and learn simultaneously in
a continuous stream. There is no
evidence for separate forward and
backward phases. Neurons receive,
process, and adapt to information in
parallel without pausing computation to
accommodate learning. The second major
issue with back propagation is its
reliance on global coordination.
Not only must there exist some kind of
central controller to switch the entire
network between forward and backward
modes, but this information must
propagate in a precise temporal
sequence. Even if neurons could somehow
freeze their activity, they would need
to unfreeze in strict succession, you
cannot compute errors for a given neuron
before its downstream partners have
finished calculating their own errors.
Everything we know about brain
physiology suggests that such global
coordination is extremely unlikely to
exist. While there are some coordinating
mechanisms, oscillations like theta and
gamma rhythms, attentional systems and
neurom modulators like dopamine that
influence broad populations. These
mechanisms operate at much coarser
temporal and spatial scales than would
be required for back propagation which
relies on cellby cell precision.
Instead, individual neurons and synapses
mostly function as autonomous agents,
modifying their states based solely on
information physically available at
their specific locations. The brain
operates in a massively parallel locally
autonomous system where computation and
learning occurs simultaneously
throughout the network in a distributed
manner without centralized control.
Now that we understand the limitations
of back propagation in biological
systems, let's explore a promising
algorithm. This framework originated
from midentth century research,
proposing that the brain's fundamental
objective is to predict incoming sensory
information. From an evolutionary
perspective, prediction enhances
survival by allowing organisms to
anticipate threats and interpret noisy
observations. There is also an
efficiency argument. Neuralactivity
demands considerable metabolic energy,
and a brain that can predict incoming
signals only needs to process unexpected
information, reducing the metabolic
burden of transmitting predictable and
thus redundant data. In this view, the
brain's primary task isn't simply
processing incoming stimuli, but
constructing an internal model that
explains sensory
inputs. When this model predicts
accurately, minimal additional
processing is required. When predictions
fail, the resulting prediction errors
signal that the internal model needs updating.
updating.
Predictive coding formalizes this
concept as a hierarchical system where
each neural layer attempts to predict
the activity of the layer below it. The
lowest level corresponds to raw sensory
input like pixels of an image while
higher levels encode increasingly
abstract features and categories that
enable effective prediction of the lower
level visual features. Although real
brains possess more complex
connectivity, including associative
connections between different
modalities, the simplified hierarchical
model captures the core
principles. Information flows
birectionally through this hierarchy.
Top-down connections carry predictions
from higher levels to lower levels,
while bottom up connections carry
prediction errors, differences between
predictions and the actual activity.
This abstract description of information
flow will guide our derivation of how
interconnect. We'll approach our network
as a so-called energy- based model.
Essentially, this means associating each
possible network state with a single
number representing some form of
abstract energy. We can then derive
rules for how the system should evolve
to reduce this energy. This framework
parallels physical systems that
naturally progress towards minimum
energy states like a ball rolling
downhill to minimize gravitational
potential energy or proteins folding to
minimize atomic interaction energy.
Since the brain is also a physical
system, it too evolves towards states
that minimize some form of energy. In
predictive coding networks, this energy
relates to the total magnitude of errors
between predictions and reality. To
visualize it, consider the following
analogy. Imagine the network as an
assembly of movable parts, springs, and
connection rods where each neuron is a
node sliding on a post. Its height
representing its activity level. On the
same post slides a platform
corresponding to its predicted activity,
determined by the neurons from the layer
above. A spring connects the neuron node
and the platform and the tension of the
spring proportional to its squared
length contributes to the overall
energy. If the neurons activity deviates
significantly from its predicted value
in either direction, the energy
increases. A neuron's activity can be
freely adjusted while its predicted
activity is determined by other neurons.
We can visualize it as rods connecting
neuron nodes on the layer above to the
platforms at a current level positioned
at variable angles corresponded to
synaptic weights which determine how
other neurons activities influence the
prediction. The sum of activities from
all neurons in the layer above
multiplied by synaptic weights
connecting them. Note that typically
activities pass through a nonlinear
activation function like sigmoid or
relu, but I'm omitting it here for
simplicity. The prediction error for
each neuron then is the difference
between its actual and predicted
activity. And the total energy
representing the overall tension of all
springs sums the squared errors across
all neurons in each layer.
The network's fundamental objective is
to minimize the total prediction error
by finding the optimal configuration of
neural activities and connection
weights. As we'll see shortly, when
presented with training examples, the
network settles interstates that balance
these elements to represent input output
relationships as accurately as possible.
So let's determine precisely how neural
activities and connection weights should
adjust to reduce this total energy. The
resulting mechanisms will align
neurohysiology. During the systems
evolution, it effectively rolls downhill
on the energy surface defined in a
highdimensional space where each
coordinate represents a parameter such
as neural activity or synaptic weight.
Mathematically, this downhill roll
corresponds to moving in the direction
of steepest descent opposite to what's
called the gradient of the function
where the gradient vector points in the
direction of steepest asend and is
composed of derivatives with respect to
each parameter. Let's isolate a specific
neuron at layer L and determine how to
adjust its activity to lower the
energy. To find this derivative, let's
revisit our energy definition where we
sum over all posts and add up the
squared lengths of all springs. Since
the derivative of a sum equals the sum
of the derivatives, we can examine each
post individually and ask if we slightly
adjust the note height x subi at player
L, how would the tension at any post
change? Then we add up all these effects.
effects.
First of all, notice that this neuron
doesn't affect the tension at any spring
at layers upstream from L. So the
derivative of all those terms is zero.
Even within layer L itself, the only
spring directly affected is the one
connecting neuron I to its predicted
value. By differentiating the square of
the prediction error, we find that the
rate of change of this neuron's activity
is the negative of its prediction error.
This makes intuitive sense. When the
error epsilon is positive, meaning the
neuron's activity exceeds its
prediction, the spring wants to contract
and pull the value down towards the
prediction, creating the negative rate
of change. Conversely, if the value is
lower than predicted, the spring tension
drives the neuron's activity
upward. But there is additional
complexity to consider. When we adjust
the height of the node at layer L beyond
effect in its own spring, it also
influences the predicted activities at
the layer below it. To compute the
complete derivative, we must account for
how change in x subi affects these downstream
downstream
errors. Recall that the predicted
activity of a neuron is given by the
weighted sum of activities of upstream
neurons. So when we change X subi at
layer L for each neuron at the layer
below, it affects the predicted value
proportionally to the weight connecting
them. To compute the total derivative,
we need to add up the prediction errors
from the layer below scaled by the
connection weights and combine them with
our earlier result.
Notice that for some downstream neuron,
if its activity is larger than its
predicted value, to reduce the tension
in the spring, we need to increase the
prediction by moving the platform up,
which can be done by moving the neuron
at the layer above up as well if the
weight coupling them is positive.
Conversely, if the prediction error is
negative, tension can be decreased by
lowering the predicted value through
decreasing the activity of the upstream
neuron. This elegant equation tells us
something profound about neural
dynamics. Activity is adjusted trying to
find a compromise between two competing
influences. The first term drives the
neuron to align with its top- down
prediction while the second term
encourages it to better predict the
layer below. When these forces settle
into balance, the neuron has found its
optimal activity level, one that
minimizes prediction errors both at its
own layer and the layer it helps to
predict. But before we move to adjusting
the weights, let's translate these
update rules from abstract springs and
neurons. Notice that each neuron must
receive its own prediction error as
input with a negative sign. Earlier we
treated this error as a kind of abstract
subtraction, but this comparison must
physically occur somewhere. We need a
mechanism to store the prediction error
so it can drive the activity
changes. This is the fundamental insight
of predictive coding. We need a separate
population of neurons explicitly
encoding prediction errors. In fact,
this is the origin of the term
predictive coding. Neurons forming a
code that represents prediction errors
rather than signals themselves. In our
framework, within each layer, we can
imagine that alongside each
representational neuron X subi, which
encodes predictions passed to the layer
below, there exists a dedicated error
neuron, a biological counterpart that
encodes the deviation of X subi from its
predicted value. With this structure in
mind, we can directly read off the
required neural connectivity from our
update rule. A representational neuron X
subi must be inhibited by its
corresponding error neuron and excited
by error neurons sending feedback
signals from the layer below. This
elegantly maps our mathematical
formulation onto biological
circuitry. Now we need to determine what
drives the error neurons themselves. By
definition, error neurons function as
comparators. Calculating the difference
between the activity of X subi and its
predicted value which is given by the
weighted combination of activities from
the layer above. This equation reveals
another set of required connections.
Error neurons receive excitatory input
from their partner representational
neurons within the same layer and
inhibitory input from neurons in the
layer above that communicate
predictions. Perfect. Now we have two
distinct populations of neurons with
specific excitatory and inhibitory
connections between them. When allowed
to unfold according to its own intrinsic
dynamics, this network will settle into
an equilibrium which minimizes
prediction errors across all layers. But
everything we have discussed so far
assumes fixed connection weights. To
complete our model, we need to endow it
capabilities. Like neural activities,
synaptic weights are also movable parts
in our system that evolve towards
configurations minimizing the total
energy. For a weight connecting neuron I
in layer L to neuron K in layer L minus
one, we can derive an update rule that
decreases the total energy by taking
steps opposite to the gradient
direction. Since our energy function
sums all squared prediction errors
across the entire network when we change
the weight coupling those two neurons,
the only term that is affected is the
prediction error at the post synaptic
neuron. The derivative equals the
negative of this prediction error
multiplied by the presinaptic neurons
activity. This gives us an elegant
update rule where weight changes are
proportional to the product of the two
activities. This rule strikingly
resembles habian plasticity in
neuroscience. Neurons that fire together
wire together. However, translating this
rule to biological neural connectivity
reveals a challenge. Predictions flow
from top to bottom with the
representational neuron I connecting to
the neuron K at the layer below. When
prediction errors flow upward from this
error neuron back to neuron I, our
derivation requires using the same
synaptic weight. But in biological
networks, these are physically distinct
sinapses and maintaining the perfect
symmetry would require instantaneous
communication between them. A phenomenon
not observed in the brain. This
so-called weight transport problem
affects both back propagation and
predictive coding.
However, closer examination of the
weight dynamics suggests a possible
resolution. For the two opposing
sinapses, the update rule is essentially
identical, differing only in which
neuron is presinaptic and which is post
synaptic. Consequently, feedback and
feed forward synapses, which should
theoretically match, may independently
converge to similar values through
similar update processes. In this way,
the very physiology of the update
naturally mitigates the weight transfer
problem. I should note that in real
models though there is a nonlinear
activation function which we have been
sweeping under the rug. When these
nonlinearities are included, the updates
for the two sapses are not
mathematically identical. Fortunately,
research suggests that perfect symmetry
may not be essential. Even when feed
forward and feedback signapses learn
independently with slightly different
update rules, the approximate symmetry
that emerges is sufficient for the
network to function effectively. This
learning rule integrates seamlessly with
the activity dynamics we derived
earlier. As neural activities settle to
minimize prediction errors for specific
inputs, the weights simultaneously adapt
to encode statistical patterns across many
many
experiences. Together, these processes
enable the network to continuously
refine its internal model, closely
mimicking how biological neural circuits
Let's now put everything together and
see how this framework operates as a
complete system. If we allow the network
to freely adjust every parameter, both
neural activities and the weights, it
would naturally settle to a zero energy
state. However, this solution would be
trivial and not perform any meaningful
computation. In practical
implementations of predictive coding and
likely in the brain itself, certain
neurons are kind of clamped to specific
values. The bottommost layer, for
example, cannot vary freely since those
neurons are directly driven by sensory
input. This constraint forces the
network to find an optimal compromise.
When presented with a training example,
the network undergoes an iterative
relaxation process. Neural activities
and weights adjust according to our
local update rules until reaching an
equilibrium configuration, an energy
minimum that encodes information about
the training example within the network
structure. Repeating this process across
diverse examples gradually refineses the
network's internal model of the world.
Through this process, the network
develops compressed representations of
data. This can be leveraged for
generative tasks when we unclamp the
output layer, freeze the weights, and
let the network run to equilibrium to
synthesize new images consistent with
its learned model.
For supervised learning tasks like
classification, we also clamp the
topmost layer to the desired label,
allowing the network to discover optimal
input to output mappings encoded in its connection