YouTube 字幕：
The Brain’s Learning Algorithm Isn’t Backpropagation

不必从头看完视频——获取完整字幕，搜索关键词，一键复制。

AutoDub

听懂YouTube外语视频

沉浸式YouTube翻译中文配音

告别语言障碍，拥抱全球优质内容

免费使用

视频字幕

视频摘要

Summary

Core Theme

Predictive coding offers a biologically plausible alternative to backpropagation for artificial neural networks, explaining how brains might learn effectively by minimizing prediction errors in a continuous, local, and parallel manner.

Mind Map

点击展开

点击探索完整互动思维导图

Of all the mysteries that the human

brain presents, perhaps one stands above

the rest. How does it learn so

effectively? In the world of artificial

intelligence, scientists and engineers

have spent decades trying to replicate

the brain's learning mechanisms. Their

efforts led to back propagation with

gradient descent, the workhorse

algorithm that powers virtually the

entire field of machine learning today.

Due to its remarkable success,

researchers began to speculate that

perhaps brains do something similar.

However, there is a fundamental problem.

The back propagation algorithm

contradicts essential biological

principles of brain function, making its

exact implementation in neural tissue

virtually impossible.

In recent years, however, an alternative

algorithm called predictive coding has

emerged that is not only more aligned

with the brain's biological hardware,

but sometimes can work even better than

the back propagation itself. In this

video, we will build predictive coding

from first principles. Explore what

issues of biological plausibility it

addresses and how it might inspire the

The fundamental challenge that

computational systems must solve is

called credit assignment. When you have

a system with numerous parameters like

connection weights between neurons that

can be adjusted to achieve a desired

output such as recognizing objects in an

image or executing appropriate actions.

How do you determine which parameters to

adjust and by how much? Artificial

neural networks solve this elegantly

through what's called automatic

differentiation. Because the entire

computation can be represented as a

mathematical function, computers use

calculus, particularly the chain rule of

derivatives, to calculate precisely how

each parameter should be nudged to

guarantee improvement in performance. If

you're interested in a deep step-by-step

derivation of how back propagation

works, I've covered this in one of my

earlier videos. However, despite its

remarkable success in machine learning,

evidence suggests that the brain almost

certainly uses a different approach.

There are various reasons why back

propagation doesn't map directly onto

neural hardware, but luckily most of

them have biologically plausible

workarounds. What is crucial for our

discussion today and why I'm extremely

excited about predictive coding is that

it addresses two fundamental constraints

that are absolutely incompatible with

neurohysiology and which are the biggest

reasons why brains cannot perform back

prop namely lack of local autonomy and

discontinuous processing. Sounds

confusing. So let's unpack what this means.

Artificial networks operate in strictly

separated phases that alternate

sequentially. First, information flows

forward. Input propagates across layers

to the output, generating a prediction.

Next, this prediction is compared

against the desired outcome, calculating

an error. Then comes the crucial

backward pass. This error travels back

through the network layer by layer

determining precisely how each weight

should change to reduce future errors.

Finally, all weights update

simultaneously and the cycle repeats

with a new training example. For this

process to work, neurons must

essentially freeze their feed forward

activity values like taking snapshots of

activity and holding on to them while

error signals flow backward. But our

brains don't work like this. They don't

hit pause between thinking and learning.

Communication in biological tissue is

relatively slow compared to silicon

processors. If the brain followed back

propagation approach, it would have to

completely stop information processing

for hundreds of milliseconds before

performing the backward pass to update

connections. Imagine experiencing brief

blackouts every time you learn something new.

new.

Instead, biological brains process

information and learn simultaneously in

a continuous stream. There is no

evidence for separate forward and

backward phases. Neurons receive,

process, and adapt to information in

parallel without pausing computation to

accommodate learning. The second major

issue with back propagation is its

reliance on global coordination.

Not only must there exist some kind of

central controller to switch the entire

network between forward and backward

modes, but this information must

propagate in a precise temporal

sequence. Even if neurons could somehow

freeze their activity, they would need

to unfreeze in strict succession, you

cannot compute errors for a given neuron

before its downstream partners have

finished calculating their own errors.

Everything we know about brain

physiology suggests that such global

coordination is extremely unlikely to

exist. While there are some coordinating

mechanisms, oscillations like theta and

gamma rhythms, attentional systems and

neurom modulators like dopamine that

influence broad populations. These

mechanisms operate at much coarser

temporal and spatial scales than would

be required for back propagation which

relies on cellby cell precision.

Instead, individual neurons and synapses

mostly function as autonomous agents,

modifying their states based solely on

information physically available at

their specific locations. The brain

operates in a massively parallel locally

autonomous system where computation and

learning occurs simultaneously

throughout the network in a distributed

manner without centralized control.

Now that we understand the limitations

of back propagation in biological

systems, let's explore a promising

algorithm. This framework originated

from midentth century research,

proposing that the brain's fundamental

objective is to predict incoming sensory

information. From an evolutionary

perspective, prediction enhances

survival by allowing organisms to

anticipate threats and interpret noisy

observations. There is also an

efficiency argument. Neuralactivity

demands considerable metabolic energy,

and a brain that can predict incoming

signals only needs to process unexpected

information, reducing the metabolic

burden of transmitting predictable and

thus redundant data. In this view, the

brain's primary task isn't simply

processing incoming stimuli, but

constructing an internal model that

explains sensory

inputs. When this model predicts

accurately, minimal additional

processing is required. When predictions

fail, the resulting prediction errors

signal that the internal model needs updating.

updating.

Predictive coding formalizes this

concept as a hierarchical system where

each neural layer attempts to predict

the activity of the layer below it. The

lowest level corresponds to raw sensory

input like pixels of an image while

higher levels encode increasingly

abstract features and categories that

enable effective prediction of the lower

level visual features. Although real

brains possess more complex

connectivity, including associative

connections between different

modalities, the simplified hierarchical

model captures the core

principles. Information flows

birectionally through this hierarchy.

Top-down connections carry predictions

from higher levels to lower levels,

while bottom up connections carry

prediction errors, differences between

predictions and the actual activity.

This abstract description of information

flow will guide our derivation of how

interconnect. We'll approach our network

as a so-called energy- based model.

Essentially, this means associating each

possible network state with a single

number representing some form of

abstract energy. We can then derive

rules for how the system should evolve

to reduce this energy. This framework

parallels physical systems that

naturally progress towards minimum

energy states like a ball rolling

downhill to minimize gravitational

potential energy or proteins folding to

minimize atomic interaction energy.

Since the brain is also a physical

system, it too evolves towards states

that minimize some form of energy. In

predictive coding networks, this energy

relates to the total magnitude of errors

between predictions and reality. To

visualize it, consider the following

analogy. Imagine the network as an

assembly of movable parts, springs, and

connection rods where each neuron is a

node sliding on a post. Its height

representing its activity level. On the

same post slides a platform

corresponding to its predicted activity,

determined by the neurons from the layer

above. A spring connects the neuron node

and the platform and the tension of the

spring proportional to its squared

length contributes to the overall

energy. If the neurons activity deviates

significantly from its predicted value

in either direction, the energy

increases. A neuron's activity can be

freely adjusted while its predicted

activity is determined by other neurons.

We can visualize it as rods connecting

neuron nodes on the layer above to the

platforms at a current level positioned

at variable angles corresponded to

synaptic weights which determine how

other neurons activities influence the

prediction. The sum of activities from

all neurons in the layer above

multiplied by synaptic weights

connecting them. Note that typically

activities pass through a nonlinear

activation function like sigmoid or

relu, but I'm omitting it here for

simplicity. The prediction error for

each neuron then is the difference

between its actual and predicted

activity. And the total energy

representing the overall tension of all

springs sums the squared errors across

all neurons in each layer.

The network's fundamental objective is

to minimize the total prediction error

by finding the optimal configuration of

neural activities and connection

weights. As we'll see shortly, when

presented with training examples, the

network settles interstates that balance

these elements to represent input output

relationships as accurately as possible.

So let's determine precisely how neural

activities and connection weights should

adjust to reduce this total energy. The

resulting mechanisms will align

neurohysiology. During the systems

evolution, it effectively rolls downhill

on the energy surface defined in a

highdimensional space where each

coordinate represents a parameter such

as neural activity or synaptic weight.

Mathematically, this downhill roll

corresponds to moving in the direction

of steepest descent opposite to what's

called the gradient of the function

where the gradient vector points in the

direction of steepest asend and is

composed of derivatives with respect to

each parameter. Let's isolate a specific

neuron at layer L and determine how to

adjust its activity to lower the

energy. To find this derivative, let's

revisit our energy definition where we

sum over all posts and add up the

squared lengths of all springs. Since

the derivative of a sum equals the sum

of the derivatives, we can examine each

post individually and ask if we slightly

adjust the note height x subi at player

L, how would the tension at any post

change? Then we add up all these effects.

effects.

First of all, notice that this neuron

doesn't affect the tension at any spring

at layers upstream from L. So the

derivative of all those terms is zero.

Even within layer L itself, the only

spring directly affected is the one

connecting neuron I to its predicted

value. By differentiating the square of

the prediction error, we find that the

rate of change of this neuron's activity

is the negative of its prediction error.

This makes intuitive sense. When the

error epsilon is positive, meaning the

neuron's activity exceeds its

prediction, the spring wants to contract

and pull the value down towards the

prediction, creating the negative rate

of change. Conversely, if the value is

lower than predicted, the spring tension

drives the neuron's activity

upward. But there is additional

complexity to consider. When we adjust

the height of the node at layer L beyond

effect in its own spring, it also

influences the predicted activities at

the layer below it. To compute the

complete derivative, we must account for

how change in x subi affects these downstream

downstream

errors. Recall that the predicted

activity of a neuron is given by the

weighted sum of activities of upstream

neurons. So when we change X subi at

layer L for each neuron at the layer

below, it affects the predicted value

proportionally to the weight connecting

them. To compute the total derivative,

we need to add up the prediction errors

from the layer below scaled by the

connection weights and combine them with

our earlier result.

Notice that for some downstream neuron,

if its activity is larger than its

predicted value, to reduce the tension

in the spring, we need to increase the

prediction by moving the platform up,

which can be done by moving the neuron

at the layer above up as well if the

weight coupling them is positive.

Conversely, if the prediction error is

negative, tension can be decreased by

lowering the predicted value through

decreasing the activity of the upstream

neuron. This elegant equation tells us

something profound about neural

dynamics. Activity is adjusted trying to

find a compromise between two competing

influences. The first term drives the

neuron to align with its top- down

prediction while the second term

encourages it to better predict the

layer below. When these forces settle

into balance, the neuron has found its

optimal activity level, one that

minimizes prediction errors both at its

own layer and the layer it helps to

predict. But before we move to adjusting

the weights, let's translate these

update rules from abstract springs and

neurons. Notice that each neuron must

receive its own prediction error as

input with a negative sign. Earlier we

treated this error as a kind of abstract

subtraction, but this comparison must

physically occur somewhere. We need a

mechanism to store the prediction error

so it can drive the activity

changes. This is the fundamental insight

of predictive coding. We need a separate

population of neurons explicitly

encoding prediction errors. In fact,

this is the origin of the term

predictive coding. Neurons forming a

code that represents prediction errors

rather than signals themselves. In our

framework, within each layer, we can

imagine that alongside each

representational neuron X subi, which

encodes predictions passed to the layer

below, there exists a dedicated error

neuron, a biological counterpart that

encodes the deviation of X subi from its

predicted value. With this structure in

mind, we can directly read off the

required neural connectivity from our

update rule. A representational neuron X

subi must be inhibited by its

corresponding error neuron and excited

by error neurons sending feedback

signals from the layer below. This

elegantly maps our mathematical

formulation onto biological

circuitry. Now we need to determine what

drives the error neurons themselves. By

definition, error neurons function as

comparators. Calculating the difference

between the activity of X subi and its

predicted value which is given by the

weighted combination of activities from

the layer above. This equation reveals

another set of required connections.

Error neurons receive excitatory input

from their partner representational

neurons within the same layer and

inhibitory input from neurons in the

layer above that communicate

predictions. Perfect. Now we have two

distinct populations of neurons with

specific excitatory and inhibitory

connections between them. When allowed

to unfold according to its own intrinsic

dynamics, this network will settle into

an equilibrium which minimizes

prediction errors across all layers. But

everything we have discussed so far

assumes fixed connection weights. To

complete our model, we need to endow it

capabilities. Like neural activities,

synaptic weights are also movable parts

in our system that evolve towards

configurations minimizing the total

energy. For a weight connecting neuron I

in layer L to neuron K in layer L minus

one, we can derive an update rule that

decreases the total energy by taking

steps opposite to the gradient

direction. Since our energy function

sums all squared prediction errors

across the entire network when we change

the weight coupling those two neurons,

the only term that is affected is the

prediction error at the post synaptic

neuron. The derivative equals the

negative of this prediction error

multiplied by the presinaptic neurons

activity. This gives us an elegant

update rule where weight changes are

proportional to the product of the two

activities. This rule strikingly

resembles habian plasticity in

neuroscience. Neurons that fire together

wire together. However, translating this

rule to biological neural connectivity

reveals a challenge. Predictions flow

from top to bottom with the

representational neuron I connecting to

the neuron K at the layer below. When

prediction errors flow upward from this

error neuron back to neuron I, our

derivation requires using the same

synaptic weight. But in biological

networks, these are physically distinct

sinapses and maintaining the perfect

symmetry would require instantaneous

communication between them. A phenomenon

not observed in the brain. This

so-called weight transport problem

affects both back propagation and

predictive coding.

However, closer examination of the

weight dynamics suggests a possible

resolution. For the two opposing

sinapses, the update rule is essentially

identical, differing only in which

neuron is presinaptic and which is post

synaptic. Consequently, feedback and

feed forward synapses, which should

theoretically match, may independently

converge to similar values through

similar update processes. In this way,

the very physiology of the update

naturally mitigates the weight transfer

problem. I should note that in real

models though there is a nonlinear

activation function which we have been

sweeping under the rug. When these

nonlinearities are included, the updates

for the two sapses are not

mathematically identical. Fortunately,

research suggests that perfect symmetry

may not be essential. Even when feed

forward and feedback signapses learn

independently with slightly different

update rules, the approximate symmetry

that emerges is sufficient for the

network to function effectively. This

learning rule integrates seamlessly with

the activity dynamics we derived

earlier. As neural activities settle to

minimize prediction errors for specific

inputs, the weights simultaneously adapt

to encode statistical patterns across many

many

experiences. Together, these processes

enable the network to continuously

refine its internal model, closely

mimicking how biological neural circuits

Let's now put everything together and

see how this framework operates as a

complete system. If we allow the network

to freely adjust every parameter, both

neural activities and the weights, it

would naturally settle to a zero energy

state. However, this solution would be

trivial and not perform any meaningful

computation. In practical

implementations of predictive coding and

likely in the brain itself, certain

neurons are kind of clamped to specific

values. The bottommost layer, for

example, cannot vary freely since those

neurons are directly driven by sensory

input. This constraint forces the

network to find an optimal compromise.

When presented with a training example,

the network undergoes an iterative

relaxation process. Neural activities

and weights adjust according to our

local update rules until reaching an

equilibrium configuration, an energy

minimum that encodes information about

the training example within the network

structure. Repeating this process across

diverse examples gradually refineses the

network's internal model of the world.

Through this process, the network

develops compressed representations of

data. This can be leveraged for

generative tasks when we unclamp the

output layer, freeze the weights, and

let the network run to equilibrium to

synthesize new images consistent with

its learned model.

For supervised learning tasks like

classification, we also clamp the

topmost layer to the desired label,

allowing the network to discover optimal

input to output mappings encoded in its connection

connection

weights. When classifying new inputs, we

simply freeze the weights, let the

system settle into equilibrium, and read

off the label from the equilibrium

activity of neurons at the top layer.

The key advantage of predictive coding

lies in its locality. While in back

propagation, all adjustments serve this

single goal of minimizing global output

error which must be transmitted

throughout the entire network. In

predictive coding, each neuron and

signapse only responds to local

prediction errors. How much a given

layer deviates from its own prediction

and how well it predicts its neighbor.

This biological plausibility and

accordance with neurohysiological data

such as observed plasticity rules

suggest that predictive coding might

very well be the key to understanding

how our own brains learn so effectively.

We can bring those insights into

artificial intelligence as well. The

local autonomy makes the algorithm

extremely parallelizable and in certain

settings more efficient than back propagation.

propagation.

Theoretical considerations suggest that

resulting updates may actually lead to

better solutions than back propagation.

While backrop focuses solely on the

overall output loss, potentially

overriding previously learned

information, a phenomenon known as

catastrophic forgetting, predictive

coding's local update rules better

preserve existing knowledge structure.

To wrap up, let's summarize what we have

explored today. By framing inference and

learning as an energy minimization

problem where each layer predicts the

activity of the layer below, we have

derived an algorithm that operates with

complete local autonomy. Unlike back

prop which requires global coordination

and separate phases for computation and

learning, predictive coding emerges as a

continuous parallel process where

neurons simultaneously predict, compare

and adapt. This approach not only aligns

with the biological constraints of

neural tissue but potentially offers

computational advantages for artificial models.

models.

As neuroscience and artificial

intelligence continue to inform each

other, predictive coding stands as a

compelling bridge between the remarkable

learning capabilities of biological

brains and the next generation of neural network

network

architectures. Speaking of efficient

learning, if you would like to gain a

deeper understanding of the foundational

concepts behind today's ideas, you're

going to love our today's sponsor, brilliant.org.

brilliant.org.

Brilliant helps you master STEM topics

by combining interactive visualizations

with hands-on problem solving. Their

engaging courses allow you to learn by

doing and build intuition, breaking down

challenging concepts into bite-sized

lessons. Especially relevant to this

video is their course titled

Introduction to Neural Networks, which

builds up from the definition of an

artificial neuron to hidden layers and

activation functions, giving you

practical experience with the building

blocks we discussed. Brilliant offers a

great collection of courses across

mathematics, physics, and computer

science. Whether you are a beginner

building core knowledge or an expert

exploring new domains, Brilliant has

something for everyone.

If you're ready to take your learning to

the next level, head to

brilliant.org/ardamcursenov to get a

30-day free trial of everything

Brilliant has to offer, plus a 20%

discount on annual subscription. If you

like the video, share it with your

friends, subscribe to the channel if you

haven't already, and press like button.

Stay tuned for more neuroscience and

点击任意文字或时间戳，即可跳转到视频对应位置

大多数字幕 5 秒内即可准备好

一键复制125+ 种语言搜索内容跳转到时间戳

粘贴 YouTube 链接

输入任意 YouTube 视频链接，获取完整字幕

大多数字幕 5 秒内即可准备好

安装 Chrome 扩展

无需离开 YouTube，一键获取视频字幕。安装我们的 Chrome 扩展，直接在视频页面访问任意视频的完整字幕。

免费添加到 Chrome

支持 YouTube、Coursera、Udemy 等主流教育平台

快速获取字幕：直接修改地址栏中的域名即可！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube 字幕正在为您准备结果……

YouTube 字幕：The Brain’s Learning Algorithm Isn’t Backpropagation