YouTube Transcript:
Neurosymbolic 80M AI from Princeton beats GPT

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

A new AI architecture called "GraphLearner" (Graph mer) from Princeton University, utilizing an encoder-only transformer with only 80 million parameters, demonstrates superior performance in building reliable knowledge graphs compared to large, decoder-based LLMs like GPT-5, potentially revolutionizing AI development by reducing reliance on massive models and complex infrastructure.

Hello community. So great that you are

back. Today we going to talk about a new

AI model just 80 million free trainable

parameter that will eradicate a GBD5

system and maybe open AI. So let's start

now. Hello and welcome to the channel

Discovery and we are looking here at the

latest AI technology and the latest

research and in part one we looked here

hey for super intelligence is a neuro

symbolic AI really the way to go can you

trust the neuro symbolic AI for super

intelligence and we find out yeah what

we need in any way for any eye system in

the future we do need massive domain

specific knowledge graph to represent

new knowledge to the syntactic GPT

systems and GPT is everything here from

a GP5 to a cloud system.

So you know what let's do this now let's

build a knowledge graph for a domain and

let's have a look at a new technology.

Now luckily for our domain let's say we

take the medical domain we build a

knowledge graph and we have there we

encounter in medical already a unified

medical language system for AI which is

gorgeous. So we can build a triple for

our knowledge graph and our triple is

simple. No, we have a head to chronic

kidney disease. Then we have the

relation from the head to the tail and

the relation is has a finding site and

then the tail and the tail is for the

chronic kidney disease of course the

kidney structure and this is every

single United States edition beautiful

normed and now we are a little bit mean.

Now we say you know what we want to

trick the system. So we manually create

now a sequence that implies now a much

weaker connection in our medical domain

and we say hey you know what maybe we

have a triplet of a chronic disease

kidney disease associated with so much

weaker link with a cerebellic gray

matter structure. Okay let's have a look

at this. So now they said okay let's

have a look at the best LLMs on this

planet. Let's have a look what we can do

and don't care if it's open source, if

it's proprietary, if we have to pay.

Let's go with the best model there.

Let's build a knowledge graph with an

LLM. So, what is the task? The task is,

hey, please complete the following

medical knowledge graph triple. You have

chronic disease has a finding size and

then I give you here the content, the

information, the answer in the text.

So, no problem at all. Yeah. In Gemini

2.5 Pro goes and says, "Hey, chronic

kidney disease has defining sight of

cereabella gray matter. This is the

wrong answer. This is the weakest link.

This is an error." And I say, "Okay, it

was just Gemini."

GBD5 makes chronic kidney disease has a

finding site cerebella gray matter. This

is again wrong.

5 cannot build from three sentences

here a knowledge graph. Grock four. Come

on. Grock four. Chronic kidney disease

has a finding site. Cerebella gray

matter. Grog four fails to build a

triplet for a knowledge graph from just

three four lines of text.

But you know there's one system left.

There's one left. And yes of course it

is clawson 4.5. And you know what? Yes

has the side kidney. Congratulation. We

have it. Just wait a minute. No, wait a

minute. Where did the word pediatric

came from? This is nowhere in the source

text. What is this? Now, if you look

closely, you see this is a classic LLM hallucination.

hallucination.

This is just something that the LLM

hallucinated and said, you know what? I

invent a new term. Great.

And now I ask you for a medi hallucinations.

hallucinations.

I don't think that they're acceptable in

any way or form.

So here we come now to the paper of

today, Princeton University and it is an

absolute innocent title. Look at this.

Efficient and scalable distillation of

reliable knowledge graph from

unstructured data. Who would ever guess

that this is here igniting an AI revolution

revolution

graph merge October 10, 2025

and they say you know we have here a

solution. We build now a graphical

multidirectional encoder representation

from the transformer architecture and

this is a tiny tiny little AI system not

with billions of parameter but just with

80 million parameter encoder only

transform architecture but you know what

we can build a model neuro symbolic stack.

stack.

We have in one model the neural learning

and the symbolic reasoning integrated in

a very clever way. So we have an

efficient graph mer encoder only

transformer that learns and distills the

complex syntactic to semantic

abstraction from a highquality domain

specific text corpus like archive or bio

archive or meta archive and then we have

the complete reasoning where we need a

knowledge graph and we bring it together

into one system.

So graph a mer you might say this sounds

familiar. Yes of course graph no this is

like I showed you here by Harvard two

three months ago they build a medical

LLM but as I show you in part one of

this video this is not able to find

medical solutions.

You need to have access to an external

object to the LLM, a knowledge graph

where you have here a beautiful

representation of all the medical

knowledge of this world.

And then the problem is how to build the

transfer and everything.

And you say mer I know something now two

three years ago we were talking about

BERT by Google and BERT was a

birectional encoder representation from

transformer. What a coincidence. Now,

guess what? It is not a coincidence

because you remember two years ago here

in January 23, I did show you this video

and we looked at a complete T5

representation of the transform

architecture by Google. And then we

said, you know what, the first half the

encoder part we called BERT and the

second half the decoder part we call GPT.

GPT.

And then something happened. GPT took

off. It was amazing. It was a stellar

flight. Everybody was talking about GPT

and everybody else almost forgot the

birth system. And now that we hit a wall

with our GPT system that we fail in the

logical reasoning of GPT systems,

suddenly Princeton comes and says, hm,

did we miss something? Did we miss an opportunity?

opportunity?

And of course GBD 3.5 at the time in

January 23 was the most important

amazing object here any

but at the same time there was also

Robera and Roberto was a robustly

optimized bird approach. So they had

here mask language modeling a dynamic

they dropped some complexity they

developed here very specific tokenizer

bite pair encoder tokenizer they had

larger mini badges and they said you

know what bird was completely

undertrained so Roberta was developed

and you know what Roberta is the base

for mer

such an old grandpari system from

January 23 is now the hottest new AI

model in town on this planet.

Now if you want to learn about bird

aspert sentence transformer architecture

if you want to build this here from

scratch here whatever in a transformer

in a caras in a pietorch I have 40

videos here on this particular playlist

and you see two years ago 3 years ago

four years ago I had tons of videos on

this and we built everything on this and

you know what prince now ask hey can we

do super intelligence

without any GPD part of the transform

former do we need open AI? Do we need

entropic and the models from them? Do we

need here the meta models, the llama

models? What about we have a different

technology that we forgot about? And

what about we examine now the options

from this forgotten technology?

So super intelligence without openi

without entropic without the main

companies that carry here the GDP growth

of the America dream.

Well it is rather simple now you just

switch here from the decoder part of the

transformer architecture

to the encoder part. That's all there is.

is.

encoder, decoder. We have here the next

token probability prediction here with

GBD5 and all those models. And now we

discover we don't need this. We don't

need the complexity. We don't need the

model size and we don't need maybe the

compute center for this because there's

another technology available.

And Prince now shows us in this

beautiful paper 70 pages pure enjoyment

how to encode a knowledge graph. now in

an encoder part of a transform architecture

architecture

and you know what without any GPT system

we don't need openi anymore we don't

need GPT systems anymore we don't need

claw systems anymore we don't need gro

systems anymore

there is an 80 million system that is

outperforming all those huge alarmms

isn't this interesting

so our classical bird and GPT where GPT

mimics the surface level linguistic

patterns of a text and now bird where we

really can learn and respect the deep

semantic and ontological rules of a

medical domain.

Now if you have a knowledge graph like

this it's simple no it's known for quite

some times and now the question is

simple how do you make a graph look like

a sentence so a transformer but careful

we are talking now about encoder part

like a Roberta not a GPT can understand it

so we will now develop a different

technology I mean Princeton shows us

that there's a way

graph mer graphical multi-directional

encoder representation from transformers

a very familiar way now but something

that we never continue to explore its possibilities

possibilities

and they tell us if we do this now we

have here a compact graph merge system

with just 80 million parameters that

completely eliminates the need for

pre-training of large unverified text

making the approach much more practical

than employing here expensive LLMs like

GPT or GLA or Grock with billions or

trillions of parameters

And this is what I call an AI revolution.

revolution.

And it can be scaled and provided with

more data. If you want to combine, let's

say here the 80 million for the pure

medical domain with a pharmacological

domain knowledge. And then you need a

little bit more of extra compute

resources. Yeah. For an 80 million

model, not a billion model.

And they said, you know, we do this.

Veron said, come on, let's do this.

Let's have fun. We train graph mer now

with 12 hidden layers in a roper

architecture with eight attention head

with a hidden size of 512 and an

intermediate size of a fully connected

layer of 2K totaling therefore to 179.7

million trainable parameters. We use

here specific tokenizers the biomat bird

tokenizers that was especially trained

on biomedical text trained on a vast

demand of the medical vocabulary.

Beautiful. Beautiful. And the tokenizer

size is relatively small with 30k.

So our vocabulary size is not that huge

that you would expect.

So let's build this. Hey, wait a minute.

So if we build this, we need a training

data set. No, this is a bird system. I

mean a merge system. I mean a graph

merge system. Now so let's do this here

on a subdomain like diabetes. Diabetes

research is really important currently

in the US and globally.

Let's build a training data set for our

merch system. So what we need what we

have we have medical papers unbelievable

now medline journals whatever you have

350,000 abstracts for training and

40,000 uh abstracts for evaluating so we

have hundreds of millions of token of

pure textual description in medical

technological papers great and then they

said let's make it easy if we have this

starting condition for this system we

want to have a nucleus where everything

is starting from so they said let's

build a seed knowledge graph a little

tiny large graph that is here the

original core from everything else will

develop from this seed knowledge graph

and of course you take the UMLS method

is ours and everything and beautiful and

you have here clinical documentation of

molecular biology and the data exchange

and great

so and then they said okay so we have

the synatic and the semantic sources now

and from the synatic data the sources.

Those are our scientific papers, the

textual papers. What we need to do now

is let's just find the entities, the

medical entities. They call it a head,

the head discovery. No. Now,

unfortunately, they said, you know what,

we have an LLM for this ready, you know,

in the shelf. Why should we build

something? So we take here an LLM

which turns out to be well not really

the best idea because we want to break

loose from LLMs but okay they said just

for the discovery not for the reason

just for the discovery of the main

technical terms in medicine we use a

Q32B model no

asking it to search for medical entities

that are now relevant for diabetes or

whatever now you could go to a medical

database and just extract it but it's

yeah you

It's so comfortable just to use an EI

system. Okay. And then I said as I told

you we need a seat knowledge graph. No a

small expert created set. Maybe you can

do this with a human created set. This

would be perfect. So you do have a

ground trth knowledge graph and you

build on this and you let it grow and

you have your typical triplets, your

head, your relation, your tail entities

and you have just to provide the initial

the starting point the semantic examples

and the ontological constraints

and they said okay we don't went with a

full primary knowledge graph in medicine

or diabetes but we just took here from

whatever the literature is 28k triplets

in the diabetes is this is available. So

let's take it. Great.

Great.

So what do you have? They have the

synatic data score sources. These are I

don't know 100,000 uh preprints here in

the meta domain. All the beautiful paper

that were published in the last two

three years on whatever topic and you

discover the entities the medical terms

and then you want to have the semantic

data structure. So you start here and

you say okay similarity matching with

the text you have the triplets you

develop here the match triplets you can

inject this and then you come here to a

seed knowledge graph. Now this seed

knowledge graph has a very particular

format. They call it here leafy chain

graphs. So let me explain what it is and

why we do it.

What we already can see here is from the

text here on the arbit research the

textual uh discovery here of the

entities you see here in orange. So the

technical terms here are here on our

graph and then if you have here the

different tails to a technical term they

have here in whatever let's call it blue

and you have relations to this. Interestingly,

Interestingly,

so again let's come back. How do you

make a knowledge graph? If we start now

with a seed knowledge graph, so this is

the nucleus where everything condenses

upon and builds up. So how can we make

this look like a sentence for a

transformer? But careful not a GPT

transformer but an encoder transformer

like Roberta.

We don't need any GPT in this.

So as I told you the leafy chain graph

encoding a new methodology invented by

Princeton or maybe further developer

Princeton is now an elegant answer. What

they do from the methodological point of

view they just flatten it. They flatten

both the syntactic information. This

means that how to build from words a

sentence information and the semantic

information. This is simply the

knowledge graph triplets the semantic

into a single unified sequential format

that the transformer again the encoder

part the roberto part can process simple

no you just have to have the idea.

So just to be sure no

birectional context is essential here.

Our bird system is here the one that

outperforms a GPT decode only system

because a GPD system can only look at

the past tokens. But you want a system

that can look at both tokens in both

direction in a multiple directions to

see every other token and input sequence simultaneously.

simultaneously.

And of course think back three years for

how we did this with bird. We have mask

elements, mask tail and the model needs

to understand the full context of the

head entity and then predict here the

mask tails.

We solved these problems three five

years ago and now we just apply it here

again in a little bit more complex

structure. So the leafy chain graph has

a distinct backbone. This is the chain

itself. the chain of the root nodes or

the in the embedded now in the syntactic

space and the branches those are the

leaf nodes and they are embedded in the

semantic space and now guess what this

leafy chain graph has now a crucial

connection of grafting a triple onto the

graph and combining both mathematical spaces

spaces

so here we see it in a flowchart so

again we have the semantic source our

seed knowledge graph with our triple

that a human maybe had as an original

starting point and then we had all the

100,000 medical papers that exist

somewhere in the world

and you see we identified here the main

medical terms for diabetes research

those are here are orange and now we

want and just padded that's it so from

the C knowledge graph from the triplets

we know our tail structures no and we

have the relations so now we combine

those we build our chain graphs that

have now the syntactic information and

the semantic information combined

combined

and we train now with a training data

set the graph merge system in more or

less the same way we trained our s bird systems.

systems.

And if we have it trained, guess what?

We use this AI system to predict the

future, to predict new research, to

predict new complexities, to predict new

patterns that can emerge here, to

predict new triplets.

And now again unfortunately they went

back here and said we just need an LLM

as a helper to help us sort this

and then we have a new extended

extracted knowledge graph with all the

knowledge here from our papers here now

in the knowledge graph. So we build up

the knowledge graph.

This is beautiful. This is beautiful.

Unfortunately again they went with an LLM.

LLM.

Let's have a look at this in a little

bit more detail. Yeah. So we have all

let's say the papath papers no the

training data set. So you have the

sequences all the leaves are empty. You

start here only with your primary

medical term for diabetes in orange. You

choose your head for the triple to be predicted.

predicted.

And then you have here a sequence with

one mask leaf. You just mask like in

bird. You're familiar with this. And

then you predicted the mask leaf tokens.

the top K, the top five, the top 10 with

the trained graph M model and then you

get the top K tokens here for the tail.

Those are the tail token candidates. I

don't know whatever the medical terms

are and then you just have it.

So you build up your knowledge graph

with new knowledge.

Now interestingly when they designed

here the graph merge pipeline so you

have now a sequence where I say okay

diabetes has the finding site and the

tail element is open and you have now

text from the papers no and here you

have this particular text never mind

this was chosen here for a particular

reason so what you do wait a minute wait

a wait a minute so what you do you have

here as we just went through here to

train graph merge system Unfortunately,

you have here help. This is not a

reasoning LLM. This is not a syncing

LLM. This LLM is just helping here

auxiliary second important data. And you

we could substitute this LLM completely

gone. But it is in the paper because it

is yeah simple and nice off the shelf. Great.

Great.

And then we have here a sequence with a

triplet and another sequence with a

triple. And what we do here the system

just evaluates now the similarity score

cosine similarity in the vector space

exactly between the triplets from the

previous step and this sequence of the

origin and only triplets with a score

higher than a particular threshold will

pass. So we have here another filter

established and then we have our output

and the output is like what we expected

here the betas has the funing side and

the queen system. Now if you don't do

this here with a graph merge system and

your test here showed us hey what I

built with an LLM now the knowledge

graph pipeline without graph mer LLM and

knowledge graph the classical way you

know what we find we find that the

solution by this system is diabetes has

the finding site urban area

yes an urban area was in the primary tax

given here what a coincidence it is had

an experiment exactly done for this but

you understand what happened the LM the

GPD system misinterpreted has a finding

site relation treating here the token

site as a geographic location and

thereby referring now to urban area as a

site instead of an anatomical structure

here which results in an invalid triplet

it is violating here the ontology of Matt

Matt

You see this is the difference now

between the old system and the new

system. And they have a lot of

additional explanation in those 70

pages. They have a ton of experimental

data but you need two days to really

have a look at this paper.

Okay. Now graph mode you know in a

transformer layer we have an embedding

layer and then we have our transformer

layer. So the very first embedding layer

is now also specific. No, not only for

the transformer layer, we have to have

here some fused nodes with some scaling

that depends on the distance in our

vector space, but we have also in the

embedding layer. We have now to combine

the syntactic space and the semantic

space into a common space. And the

question is, hey, does a common space

exist at all on a mathematical level?

No. So what we have now is

an attention mechanism and here of

course a graphical attention mechanism a

hierarchical graphical attention

mechanism that encodes now our semantic

triples in the knowledge graph and they

are leaves connected to the root nodes.

No and now this attention mechanism on

the graph uses now the leaves the

relations and the head embeddings

resulting now in a new updated fused

node feature.

So in the attention layers the attention

weights are now multiplied here in the

transform architecture here in the

layers here by a function that

exponentially decreases with the

pair-wise distance and they encode here

the graph relations and the graph

distance respectively. This is

absolutely similar what you know here

from a rag system from a vector uh space

interpretation here of a cosine

similarity where you have closeness

absolute environments equaling semantic

similarity but this is you know advanced level

this hierarchical graph attention

mechanism is a key feature in graph

merge for the training in general. So

for each injected triplet here our HAT

performs a relation where attention to

encode the full triplet semantics into

the leaf embattings thereby of course

replacing the initial leaf token

embings. So we update here our

relations. We update the integration of

new knowledge into the knowledge graph

and this occurs in the input embedding

layer before the transformer processing

enabling therefore back propagation to

the relations during the mask node

modeling that we know from bird. And

this happens layer by layer in a

hierarchical structure. Therefore the

name hierarchical graph attention to

make sure the model captures you the

bigger picture of how facts connect. Now

new knowledge is built and integrated

with the old old knowledge quotation

mark which helps avoid mistakes and

build a reliable knowledge graph you can

trust because you can immediately find

what was the original sentence and you

hope that there are no hallucination at

all and here you have the complete flowchart.

flowchart.

This is not a flowchart. So we start

with a training data set for our graph

merge system. We have now unfortunately

with an LLM here that the technical

terms are discovered here the entity

discovery within an LLM. Then we have a

similarity matching and we have to have

a seed knowledge graph. So we have here

the papers and the seat knowledge graph.

Those are our input data if you want.

And then we have here the training of

this beautiful new graph merge system.

The encoder only part of a transform

architecture. Now this now can if it's

trained predict your details with graph

mode combine the tails token

unfortunately again with another LLM

because it was convenient just to take

an LLM and not to find a better solution

without an LLM. We have a similarity

check and batting base and we have now

an expanded added grown knowledge graph

with more knowledge. This is it.

Now you see the warters did use helper

LLMs not reasoning LLMs. No just helper

llms for three auxiliary task. First as

I told you discovering the head

entities. Second selecting the relations

for subsequent graph merge predictions

and combining single token prediction

into nice sentences into meaningful

relation aware tale phrases into chip

just is building nice sentences that

humans understand

but there's no logic in it.

So let's have a look at the mathematical

spaces and I know you have been waiting

for this of course me too. So we do have

here from graph mer the backbone. This

is here the chain of the root nodes and

this is here from the research paper

from medical for medbup or whatever you

have. This is the syntactic space itself.

itself.

We broke it down into individual token.

A token becomes a root node in the chain

and this part of the graph represents

yet the pure syntactic space. It holds

the original unstructured text providing

you the grammatical and the contextual

foundation of any knowledge that will be

extracted. Now in the paper they chosen

here a specific number. So each input

sequence is standardized to have a fixed

number of root notes

128 tokens they chosen. This is it.

Okay. And if a sentence is shortened,

it's just padded here with the pad

tokens. And if it's longer, it is

truncated or split. Standard part. So

let's have an example. The sentence

metformin treats type 2 diabetes has

become now a chain of root notes.

Metformine treats type two diabetes of a

Then we have the branches, the leaves.

Now just to make clear the leaves leave

in a different mathematical space in the

semantic space.

Each root node in the chain has a fixed

number of leave nodes attached to it.

They decided to go with seven just

chosen. This is it. You can choose 15.

Think of them as dedicated slots or

containers of holding semantic

information to this specific route. So

the leaves represent now the semantic

space. This is where the tail entities

of the knowledge graph triplets are

placed. And in the paper, each of the

128 root nodes has chosen seven leaf

nodes. This was enough for a simple uh

diabetus medical domain.

So this creates now a large regular grid

of 896 leaf tokens. And during the

training, of course, most of those

leaves are empty or filled with a pad

token. And only the leaves corresponding

to a header entity in a known triplet

are populated.

So you could argue well this is a sparse

graph. Yeah, but it is a beautiful

graph. There is no hallucination yet

except where we use three times the LMS.

But more about this later. And now the

crucial connection. Guess what? We bring

it together. Grafting your triplet onto

the graph.

So the head the head entity is the span

of a text found within the sentence.

Therefore already exists in one or more

root nodes. Let's say metaphor mean is

the first root node. disc grounds the

sematic effect directly to its textual

origin. We know exactly where this

information came from from which

original paper medical paper. The tail

the tail entity is placed onto the leaf

nodes that are directly connected to the

head's root nodes. For example, the

token type two clear would be placed

into one of the seven leaves associated

with the metformin root node and the

relation acts as the type of connection

or the edge attribute that we know from

graph theoretical mathematical physics.

Let's have an example. Chronic kidney

disease is a renal disorder.

The triplet the C triplet is chronic

kidney disease as a finding site kidney

structure. So what are the root notes

here? Chronic kidney disease CKD is a

renal disorder.

The seven leaves for the root note to

chronic the first one here become kidney

the relation has a finding site is now

implicitly defined by the connection

between the headspan chronic kidney

disease and the tails pad kidney

structure and this connection will be

explicitly modeled and trained during or

during the training by h gate by this

attention mechanism for the graphical system.

system.

So the final flattened sequence and we

did all of this because we wanted to

have a flattened sequence as an input to

our encoder transformer blocks looks

something like this chronic disease

disorder get structure pet by pat. So

this is exactly 128 root tokens followed

by 128 * 7 leaf tokens. So we have a

flattener representation where a known

transform architecture can work with this.

this.

It turns out Princeton tells us, hey,

this is effective. Why? Why is this

encoding so effective? It enables the a

joint mask language and mask node

modeling training. So the model can be

trained to predict the mask tokens in

the root chain thereby learning the

syntax and the mask tokens in the leaf

space thereby learning the semantics at

the same time in the same model. How

beautiful is this? Of course, it creates

a puffer for the gradient flow to the

relations and during the training. Yes,

beautiful. You know this. So here we

have another example now a screenshot

from the paper and I just want to give

you here a feeling for this. So we have

a sequence attack sequence one or says

yes it's about non-alcoholic fatty liver

disease. Beautiful. And yes and this is

it. The parts ecological paraphrase.

What is the head? This one. What is the

relation is a plays a role has this

position cause of associated with and

now the top 20 you remember the top k

the top 20 graph predicted tokens

and here you see this so for each is a

play role relation you have not a

predicted top 20 tokens

this is based absolutely on the new

knowledge graph

so if you I want to see this now in the

triplet structure. The head you have

here the head you have here the specific

relation it was trained on and this is

now the tail it provides

form from the graph mer predicted tokens.

tokens.

Now just to make this clear the

mathematical complexity is not trivial.

We are working with different

mathematical spaces. Now we have a

syntactic space.

This is more or less if you want what

our TPT did for us. But we have now this

in our model. So we have to integrate

the syntactic space with the semantic

space. So a vector space for grammar and

context like GPT training was all of the

internet to learn the grammar and to

learn the context.

And then we have the semantic space.

This is the knowledge graph. This is the

real given structure and dependencies of

our technical terms in medicine. This is

a vector space for meaning and facts

and yeah if you want anontology is a

formal specification of the terms and

the relations in a specific domain. We

have medicine and we have diabetes as a subdomain.

subdomain.

So let's have a look at the synthetic

space. When we say synthetic space we

are conceptually referring to a domain

of a language structure. grammar,

specific word order, parts of a speech

and the contextual relationship between

the words in a sentence. This is exactly syntactic

syntactic

how a sentence is built. The

mathematical representation in graph

this concept is given a concrete

mathematical form a highdimensional

vector space in the simplest form or you

can go to a vector representation

populated by the embeddings of the root

nodes. the tokens remember from the

original text from the original MET pup papers.

papers.

So every token in the source sentence

like metformine treats disease is now

mapped to a vector a 512dimensional

vector in our Roberto architecture and

the M's encoder encoder not decoder the

encoder layers are now trained via the

mask language modeling a specific loss

function that you know from bird to

manipulate these vectors based on their context.

context.

This is why context engineering is now

so everybody's talking about it.

The result is that words with a similar

grammatical function or a contextual

role will have vectors that are now

close by an absolute environment in this

space. For example, the vector for the

word treats and managers would likely

appear near each other because they are

both transitive verbs that often appear

Second space is the sematic space. the

vector space for meaning and facts. So

such a beautiful space refers to domain

of meaning, concepts, factual

relationships. It is about what things

are and how they relate to each other

ontologically in this domain. For

example, it captures here the fact that

diabetes is a disease and metapformine

is a drug used to treat it.

So this is now maybe the same

highdimensional vector space maybe it's

so different but it is now populated by

the embeddings of the leave node the

tokens that form here the tail of the

knowledge graph triplets and this space

is specifically structured by the mask

node modeling loss in the edge gate

module. So this training forces here the

vector in the semantic space to align

based on the factual ontological

relationship of the domain not the word

occurrences like we had in GPD. And this

is the beauty

the critical insight if you want now

from this paper by Princeton is let's

build a common shared space out of these

two subspaces.

So the syntactic the root space and the

semantic the leaf space

they can you can build them that they

are not separate spaces. Maybe you can

build them that they are just two

different views or two different sub

regions, subcomplexity regions of the

same unified highdimensional much more

higher dimensional embedding space.

This is here about complexity theorem

and mathematics. This is this is

something really beautiful but not part

of my video here. So again the syntactic

training the mask language model

arranges here let's say the cities are

the tokens on a map based on the road

network how you can travel from one city

to another with the correct grammar.

The mask node modeling for the semantic

training rearranges the city now on the

same map but based on their functions

like you group all capital cities

together or all industrial cities or old

coal cities.

So you see we built here those new

systems here in the same space.

Now you hope that the output of graph

mer is now the perfect knowledge graph.

added new knowledge, build up new

knowledge, integrated the very latest

knowledge data and solution a simple no

you lose a graph query language sparkle

or cipher like running SQL query on a

traditional database there's no

ambiguity there's no hallucination it is

just pure cold logic and this is what

you need in medicine this is what you

need in finance this is what you need in

theoretical physics and in quantum mechanics

mechanics

the graph database will execute any

query and return a list of drugs like

metformin or whatever with a

mathematical certainty.

This is impossible to do with a standard

LLM because an standard LLM like a GPD

system is just a statistical pattern.

And then of course you have the solution

B. And now it gets interesting graph rack

rack

remember now what's interesting is that

the authors use graph rack to validate

here the graph MERT architecture.

Show you this in a minute. But graph

doctor ask in a natural language. Hey,

what are the treatment options for a

patient with diabetes who is also obese

to the eye system and then this is AI is

retrieving now from a reliable knowledge graph.

graph.

It first queries here your graph merge

generated knowledge graph to find here

the relevant facts the sub networks here

small local subg graphs of the verified

triplet 1 2 3 4 beautiful and then it

just augments you the prompt you have in

context learning you have short example

prompt looks something like this you

have the system prompt hey your helpful

medical assistant using only the

following facts answer the user's

questions facts are 1 2 3 4 the user

question is what are the treatment

option now you see here the LLM now

generates an answer but its creativity is

I wouldn't say really completely but 90%

completely constrained it is forced to

reason over those four provided facts

hopefully but of course it can

hallucinate at any time in an amount

that is undefined

but it is hopefully reducing the

hallucination with our knowledge graph integration

integration

okay okay let's come to a summary

I know what you say. Hey, where are the

training algorithms? How do I code here

this? What are the unifying objective if

we build together spaces? Look, it is so

simple. You have a dual objective that

distills here the syntactic patterns

into the semantic prediction learning

the global relations across here the

complete text corpus.

It is simple, but this is just the beginning.

beginning.

Yeah, results. We have to look hey is

this thing working or was this just an

idea. So here we have the facts score

knowledge graph evaluation in percentage

points and there's a whole definition

and pages and pages in the original

paper. I just give you the result.

If you build a knowledge graph with an LLM

LLM

your percentage point your performance

is 40%.

For the context and 48% for context and

the general truth as defined in the

paper. Now if you have this new graph

merge system the context only jumps from

40% close to 70% and from 48 to 72%.

So if you build a a knowledge graph with

an LLM or GPD system or with the graph

merge system there's really a

difference. Look at this 40 to 70%.

Now of course you might ask hey but why

are those LLM knowledge graph fact

scores are so low and Princeton looked

at this and there's a complete analysis

and they said the LLM knowledge graph

scores poorly performed when validated

here on the Q32B

which flags here its own error and

this underscores here the weak prompts durability

durability

knowledge may exist in the parameters

yet the promptbased generation fails to

elicit correct ontology respecting

triplets. This is what I started the

video with with this beautiful example.

GBD fail, GBD5 fails, Guac 4 fails,

And they even did a further analysis and

they found three facts. The knowledge

graph built by an LLM by GBD system has

a relation misinterpretation. The GPT

model mapped the relation by lexical

similarity rather than in our domain

ontological meaning drawing only from

its internal knowledge and not from the

given facts. Second, you have a systemic

malformed repetition. The same

ill-formed triplets reappear across

different text chunks.

An overlining

a head tail pair. multiple largely

invalid relations in our knowledge graph

are assigned to the same entity pair.

So now we understand why we cannot build

knowledge graph with LLMs that are based

on GPT systems on the decoder port of a

transform architecture.

And now I know what you think. You say

okay this was the Q1 but you know let's

go to the best model maybe we have GPD5

syncing max full power. Let's do this.

Let's go for the most expensive best

goodlooking M GB5 full syncing. And you

know what? Look at this on the left

side. You have here the performance

here. 100% for graph mer never mind what

it is in detail. You have this over

pages on pages in original paper and

GBD5 syncing. Look at this. Now

absolutely equal performance. This is great.

great.

But unfortunately Princeton put this in

the first place to show us if you

compare this to additional medical terms

it looks like this.

So you always see here the left uh

indicator is here the graph merge system

and the right indicator here of this

bulk is your GPD5 max syncing mode.

Green is what we want, orange is maybe

and the red is failure.

So here on the term metaphor mean the

graph merge system achieved here if you

want here I don't know 95%

of yes and maybe but GBD5 syncing

achieved something below 40%.

So there is a massive difference but it

doesn't mean that statistically

you can get lucky with one term like

this term here but on all the other

terms GP5 thinking fails to build here

the knowledge graph.

Now if you look at this you say okay

here let's this 80% performance this is

85% performance 90% performance 95%

performance why is this not 100%. Why is

is graph mer not 100%. Why is graph mer

not better than 90%.

And here quote by the authors here by

Princeton the main issue in the graph

triples are vakeness and incomplete

tails. And you see how is this possible?

And then it turns out the tail

incompleteness arises when the helper

LLM the TPT

that we have to have as an auxiliary AI

system. The helper LLM accepts an

incomplete token as a valid tail during

the token combination. And I say it

doesn't matter what helper LLM we

tested, it's always the same. All the

LLMs elucinate. As I showed you at the

very beginning of this video here when

why graph mer is not better because

unfortunately we still have those three

times their llm that are not reasoning

but are just helping to filter out

things and they have now that they have

incomplete token as a valid tail token.

So we know where the error comes from.

What are the insights?

I think graph merge success is based on

a complete novel approach. It really

enforces here an ontological alignment

in the medical domain during the

training and extraction process. And

this is a good thing.

And instead the old way with an LM with

a GPT system was it was just guessing.

It was calculating statistical

fluctuation and probability densities.

the next token prediction based on a

surface level word correlations from

their training data.

But now graph mer learns semantic

relations embeddings that adhere to the

strict rules of a professional graded ontology.

ontology.

And this is the beauty now of the

system. And you see what we don't need a

GPD system at all for the next

generation of AI.

And you might ask, do we need opening at all?

all?

Do we need, I don't know, perplexity,

entropic models at all?

Or do we just discovered a new technology

technology

that would allow us to build just 80

million free trainable parameter models

and we don't even need the complex data

infrastructure from Nvidia?

Isn't this beautiful? what new

technology can maybe hopefully

theoretically achieve

and I think this is just the beginning.

So if you're interested hey why not

subscribe or maybe even join and become

a member of my channel and I hope to see

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Neurosymbolic 80M AI from Princeton beats GPT