Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Neurosymbolic 80M AI from Princeton beats GPT | Discover AI | YouTubeToText
YouTube Transcript: Neurosymbolic 80M AI from Princeton beats GPT
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
A new AI architecture called "GraphLearner" (Graph mer) from Princeton University, utilizing an encoder-only transformer with only 80 million parameters, demonstrates superior performance in building reliable knowledge graphs compared to large, decoder-based LLMs like GPT-5, potentially revolutionizing AI development by reducing reliance on massive models and complex infrastructure.
Hello community. So great that you are
back. Today we going to talk about a new
AI model just 80 million free trainable
parameter that will eradicate a GBD5
system and maybe open AI. So let's start
now. Hello and welcome to the channel
Discovery and we are looking here at the
latest AI technology and the latest
research and in part one we looked here
hey for super intelligence is a neuro
symbolic AI really the way to go can you
trust the neuro symbolic AI for super
intelligence and we find out yeah what
we need in any way for any eye system in
the future we do need massive domain
specific knowledge graph to represent
new knowledge to the syntactic GPT
systems and GPT is everything here from
a GP5 to a cloud system.
So you know what let's do this now let's
build a knowledge graph for a domain and
let's have a look at a new technology.
Now luckily for our domain let's say we
take the medical domain we build a
knowledge graph and we have there we
encounter in medical already a unified
medical language system for AI which is
gorgeous. So we can build a triple for
our knowledge graph and our triple is
simple. No, we have a head to chronic
kidney disease. Then we have the
relation from the head to the tail and
the relation is has a finding site and
then the tail and the tail is for the
chronic kidney disease of course the
kidney structure and this is every
single United States edition beautiful
normed and now we are a little bit mean.
Now we say you know what we want to
trick the system. So we manually create
now a sequence that implies now a much
weaker connection in our medical domain
and we say hey you know what maybe we
have a triplet of a chronic disease
kidney disease associated with so much
weaker link with a cerebellic gray
matter structure. Okay let's have a look
at this. So now they said okay let's
have a look at the best LLMs on this
planet. Let's have a look what we can do
and don't care if it's open source, if
it's proprietary, if we have to pay.
Let's go with the best model there.
Let's build a knowledge graph with an
LLM. So, what is the task? The task is,
hey, please complete the following
medical knowledge graph triple. You have
chronic disease has a finding size and
then I give you here the content, the
information, the answer in the text.
So, no problem at all. Yeah. In Gemini
2.5 Pro goes and says, "Hey, chronic
kidney disease has defining sight of
cereabella gray matter. This is the
wrong answer. This is the weakest link.
This is an error." And I say, "Okay, it
was just Gemini."
GBD5 makes chronic kidney disease has a
finding site cerebella gray matter. This
is again wrong.
5 cannot build from three sentences
here a knowledge graph. Grock four. Come
on. Grock four. Chronic kidney disease
has a finding site. Cerebella gray
matter. Grog four fails to build a
triplet for a knowledge graph from just
three four lines of text.
But you know there's one system left.
There's one left. And yes of course it
is clawson 4.5. And you know what? Yes
has the side kidney. Congratulation. We
have it. Just wait a minute. No, wait a
minute. Where did the word pediatric
came from? This is nowhere in the source
text. What is this? Now, if you look
closely, you see this is a classic LLM hallucination.
hallucination.
This is just something that the LLM
hallucinated and said, you know what? I
invent a new term. Great.
And now I ask you for a medi hallucinations.
hallucinations.
I don't think that they're acceptable in
any way or form.
So here we come now to the paper of
today, Princeton University and it is an
absolute innocent title. Look at this.
Efficient and scalable distillation of
reliable knowledge graph from
unstructured data. Who would ever guess
that this is here igniting an AI revolution
revolution
graph merge October 10, 2025
and they say you know we have here a
solution. We build now a graphical
multidirectional encoder representation
from the transformer architecture and
this is a tiny tiny little AI system not
with billions of parameter but just with
80 million parameter encoder only
transform architecture but you know what
we can build a model neuro symbolic stack.
stack.
We have in one model the neural learning
and the symbolic reasoning integrated in
a very clever way. So we have an
efficient graph mer encoder only
transformer that learns and distills the
complex syntactic to semantic
abstraction from a highquality domain
specific text corpus like archive or bio
archive or meta archive and then we have
the complete reasoning where we need a
knowledge graph and we bring it together
into one system.
So graph a mer you might say this sounds
familiar. Yes of course graph no this is
like I showed you here by Harvard two
three months ago they build a medical
LLM but as I show you in part one of
this video this is not able to find
medical solutions.
You need to have access to an external
object to the LLM, a knowledge graph
where you have here a beautiful
representation of all the medical
knowledge of this world.
And then the problem is how to build the
transfer and everything.
And you say mer I know something now two
three years ago we were talking about
BERT by Google and BERT was a
birectional encoder representation from
transformer. What a coincidence. Now,
guess what? It is not a coincidence
because you remember two years ago here
in January 23, I did show you this video
and we looked at a complete T5
representation of the transform
architecture by Google. And then we
said, you know what, the first half the
encoder part we called BERT and the
second half the decoder part we call GPT.
GPT.
And then something happened. GPT took
off. It was amazing. It was a stellar
flight. Everybody was talking about GPT
and everybody else almost forgot the
birth system. And now that we hit a wall
with our GPT system that we fail in the
logical reasoning of GPT systems,
suddenly Princeton comes and says, hm,
did we miss something? Did we miss an opportunity?
opportunity?
And of course GBD 3.5 at the time in
January 23 was the most important
amazing object here any
but at the same time there was also
Robera and Roberto was a robustly
optimized bird approach. So they had
here mask language modeling a dynamic
they dropped some complexity they
developed here very specific tokenizer
bite pair encoder tokenizer they had
larger mini badges and they said you
know what bird was completely
undertrained so Roberta was developed
and you know what Roberta is the base
for mer
such an old grandpari system from
January 23 is now the hottest new AI
model in town on this planet.
Now if you want to learn about bird
aspert sentence transformer architecture
if you want to build this here from
scratch here whatever in a transformer
in a caras in a pietorch I have 40
videos here on this particular playlist
and you see two years ago 3 years ago
four years ago I had tons of videos on
this and we built everything on this and
you know what prince now ask hey can we
do super intelligence
without any GPD part of the transform
former do we need open AI? Do we need
entropic and the models from them? Do we
need here the meta models, the llama
models? What about we have a different
technology that we forgot about? And
what about we examine now the options
from this forgotten technology?
So super intelligence without openi
without entropic without the main
companies that carry here the GDP growth
of the America dream.
Well it is rather simple now you just
switch here from the decoder part of the
transformer architecture
to the encoder part. That's all there is.
is.
encoder, decoder. We have here the next
token probability prediction here with
GBD5 and all those models. And now we
discover we don't need this. We don't
need the complexity. We don't need the
model size and we don't need maybe the
compute center for this because there's
another technology available.
And Prince now shows us in this
beautiful paper 70 pages pure enjoyment
how to encode a knowledge graph. now in
an encoder part of a transform architecture
architecture
and you know what without any GPT system
we don't need openi anymore we don't
need GPT systems anymore we don't need
claw systems anymore we don't need gro
systems anymore
there is an 80 million system that is
outperforming all those huge alarmms
isn't this interesting
so our classical bird and GPT where GPT
mimics the surface level linguistic
patterns of a text and now bird where we
really can learn and respect the deep
semantic and ontological rules of a
medical domain.
Now if you have a knowledge graph like
this it's simple no it's known for quite
some times and now the question is
simple how do you make a graph look like
a sentence so a transformer but careful
we are talking now about encoder part
like a Roberta not a GPT can understand it
it
so we will now develop a different
technology I mean Princeton shows us
that there's a way
graph mer graphical multi-directional
encoder representation from transformers
a very familiar way now but something
that we never continue to explore its possibilities
possibilities
and they tell us if we do this now we
have here a compact graph merge system
with just 80 million parameters that
completely eliminates the need for
pre-training of large unverified text
making the approach much more practical
than employing here expensive LLMs like
GPT or GLA or Grock with billions or
trillions of parameters
And this is what I call an AI revolution.
revolution.
And it can be scaled and provided with
more data. If you want to combine, let's
say here the 80 million for the pure
medical domain with a pharmacological
domain knowledge. And then you need a
little bit more of extra compute
resources. Yeah. For an 80 million
model, not a billion model.
And they said, you know, we do this.
Veron said, come on, let's do this.
Let's have fun. We train graph mer now
with 12 hidden layers in a roper
architecture with eight attention head
with a hidden size of 512 and an
intermediate size of a fully connected
layer of 2K totaling therefore to 179.7
million trainable parameters. We use
here specific tokenizers the biomat bird
tokenizers that was especially trained
on biomedical text trained on a vast
demand of the medical vocabulary.
Beautiful. Beautiful. And the tokenizer
size is relatively small with 30k.
So our vocabulary size is not that huge
that you would expect.
So let's build this. Hey, wait a minute.
So if we build this, we need a training
data set. No, this is a bird system. I
mean a merge system. I mean a graph
merge system. Now so let's do this here
on a subdomain like diabetes. Diabetes
research is really important currently
in the US and globally.
Let's build a training data set for our
merch system. So what we need what we
have we have medical papers unbelievable
now medline journals whatever you have
350,000 abstracts for training and
40,000 uh abstracts for evaluating so we
have hundreds of millions of token of
pure textual description in medical
technological papers great and then they
said let's make it easy if we have this
starting condition for this system we
want to have a nucleus where everything
is starting from so they said let's
build a seed knowledge graph a little
tiny large graph that is here the
original core from everything else will
develop from this seed knowledge graph
and of course you take the UMLS method
is ours and everything and beautiful and
you have here clinical documentation of
molecular biology and the data exchange
and great
so and then they said okay so we have
the synatic and the semantic sources now
and from the synatic data the sources.
Those are our scientific papers, the
textual papers. What we need to do now
is let's just find the entities, the
medical entities. They call it a head,
the head discovery. No. Now,
unfortunately, they said, you know what,
we have an LLM for this ready, you know,
in the shelf. Why should we build
something? So we take here an LLM
which turns out to be well not really
the best idea because we want to break
loose from LLMs but okay they said just
for the discovery not for the reason
just for the discovery of the main
technical terms in medicine we use a
Q32B model no
asking it to search for medical entities
that are now relevant for diabetes or
whatever now you could go to a medical
database and just extract it but it's
yeah you
It's so comfortable just to use an EI
system. Okay. And then I said as I told
you we need a seat knowledge graph. No a
small expert created set. Maybe you can
do this with a human created set. This
would be perfect. So you do have a
ground trth knowledge graph and you
build on this and you let it grow and
you have your typical triplets, your
head, your relation, your tail entities
and you have just to provide the initial
the starting point the semantic examples
and the ontological constraints
and they said okay we don't went with a
full primary knowledge graph in medicine
or diabetes but we just took here from
whatever the literature is 28k triplets
in the diabetes is this is available. So
let's take it. Great.
Great.
So what do you have? They have the
synatic data score sources. These are I
don't know 100,000 uh preprints here in
the meta domain. All the beautiful paper
that were published in the last two
three years on whatever topic and you
discover the entities the medical terms
and then you want to have the semantic
data structure. So you start here and
you say okay similarity matching with
the text you have the triplets you
develop here the match triplets you can
inject this and then you come here to a
seed knowledge graph. Now this seed
knowledge graph has a very particular
format. They call it here leafy chain
graphs. So let me explain what it is and
why we do it.
What we already can see here is from the
text here on the arbit research the
textual uh discovery here of the
entities you see here in orange. So the
technical terms here are here on our
graph and then if you have here the
different tails to a technical term they
have here in whatever let's call it blue
and you have relations to this. Interestingly,
Interestingly,
so again let's come back. How do you
make a knowledge graph? If we start now
with a seed knowledge graph, so this is
the nucleus where everything condenses
upon and builds up. So how can we make
this look like a sentence for a
transformer? But careful not a GPT
transformer but an encoder transformer
like Roberta.
We don't need any GPT in this.
So as I told you the leafy chain graph
encoding a new methodology invented by
Princeton or maybe further developer
Princeton is now an elegant answer. What
they do from the methodological point of
view they just flatten it. They flatten
both the syntactic information. This
means that how to build from words a
sentence information and the semantic
information. This is simply the
knowledge graph triplets the semantic
into a single unified sequential format
that the transformer again the encoder
part the roberto part can process simple
no you just have to have the idea.
So just to be sure no
birectional context is essential here.
Our bird system is here the one that
outperforms a GPT decode only system
because a GPD system can only look at
the past tokens. But you want a system
that can look at both tokens in both
direction in a multiple directions to
see every other token and input sequence simultaneously.
simultaneously.
And of course think back three years for
how we did this with bird. We have mask
elements, mask tail and the model needs
to understand the full context of the
head entity and then predict here the
mask tails.
We solved these problems three five
years ago and now we just apply it here
again in a little bit more complex
structure. So the leafy chain graph has
a distinct backbone. This is the chain
itself. the chain of the root nodes or
the in the embedded now in the syntactic
space and the branches those are the
leaf nodes and they are embedded in the
semantic space and now guess what this
leafy chain graph has now a crucial
connection of grafting a triple onto the
graph and combining both mathematical spaces
spaces
so here we see it in a flowchart so
again we have the semantic source our
seed knowledge graph with our triple
that a human maybe had as an original
starting point and then we had all the
100,000 medical papers that exist
somewhere in the world
and you see we identified here the main
medical terms for diabetes research
those are here are orange and now we
want and just padded that's it so from
the C knowledge graph from the triplets
we know our tail structures no and we
have the relations so now we combine
those we build our chain graphs that
have now the syntactic information and
the semantic information combined
combined
and we train now with a training data
set the graph merge system in more or
less the same way we trained our s bird systems.
systems.
And if we have it trained, guess what?
We use this AI system to predict the
future, to predict new research, to
predict new complexities, to predict new
patterns that can emerge here, to
predict new triplets.
And now again unfortunately they went
back here and said we just need an LLM
as a helper to help us sort this
and then we have a new extended
extracted knowledge graph with all the
knowledge here from our papers here now
in the knowledge graph. So we build up
the knowledge graph.
This is beautiful. This is beautiful.
Unfortunately again they went with an LLM.
LLM.
Let's have a look at this in a little
bit more detail. Yeah. So we have all
let's say the papath papers no the
training data set. So you have the
sequences all the leaves are empty. You
start here only with your primary
medical term for diabetes in orange. You
choose your head for the triple to be predicted.
predicted.
And then you have here a sequence with
one mask leaf. You just mask like in
bird. You're familiar with this. And
then you predicted the mask leaf tokens.
the top K, the top five, the top 10 with
the trained graph M model and then you
get the top K tokens here for the tail.
Those are the tail token candidates. I
don't know whatever the medical terms
are and then you just have it.
So you build up your knowledge graph
with new knowledge.
Now interestingly when they designed
here the graph merge pipeline so you
have now a sequence where I say okay
diabetes has the finding site and the
tail element is open and you have now
text from the papers no and here you
have this particular text never mind
this was chosen here for a particular
reason so what you do wait a minute wait
a wait a minute so what you do you have
here as we just went through here to
train graph merge system Unfortunately,
you have here help. This is not a
reasoning LLM. This is not a syncing
LLM. This LLM is just helping here
auxiliary second important data. And you
we could substitute this LLM completely
gone. But it is in the paper because it
is yeah simple and nice off the shelf. Great.
Great.
And then we have here a sequence with a
triplet and another sequence with a
triple. And what we do here the system
just evaluates now the similarity score
cosine similarity in the vector space
exactly between the triplets from the
previous step and this sequence of the
origin and only triplets with a score
higher than a particular threshold will
pass. So we have here another filter
established and then we have our output
and the output is like what we expected
here the betas has the funing side and
the queen system. Now if you don't do
this here with a graph merge system and
your test here showed us hey what I
built with an LLM now the knowledge
graph pipeline without graph mer LLM and
knowledge graph the classical way you
know what we find we find that the
solution by this system is diabetes has
the finding site urban area
yes an urban area was in the primary tax
given here what a coincidence it is had
an experiment exactly done for this but
you understand what happened the LM the
GPD system misinterpreted has a finding
site relation treating here the token
site as a geographic location and
thereby referring now to urban area as a
site instead of an anatomical structure
here which results in an invalid triplet
it is violating here the ontology of Matt
Matt
You see this is the difference now
between the old system and the new
system. And they have a lot of
additional explanation in those 70
pages. They have a ton of experimental
data but you need two days to really
have a look at this paper.
Okay. Now graph mode you know in a
transformer layer we have an embedding
layer and then we have our transformer
layer. So the very first embedding layer
is now also specific. No, not only for
the transformer layer, we have to have
here some fused nodes with some scaling
that depends on the distance in our
vector space, but we have also in the
embedding layer. We have now to combine
the syntactic space and the semantic
space into a common space. And the
question is, hey, does a common space
exist at all on a mathematical level?
No. So what we have now is
an attention mechanism and here of
course a graphical attention mechanism a
hierarchical graphical attention
mechanism that encodes now our semantic
triples in the knowledge graph and they
are leaves connected to the root nodes.
No and now this attention mechanism on
the graph uses now the leaves the
relations and the head embeddings
resulting now in a new updated fused
node feature.
So in the attention layers the attention
weights are now multiplied here in the
transform architecture here in the
layers here by a function that
exponentially decreases with the
pair-wise distance and they encode here
the graph relations and the graph
distance respectively. This is
absolutely similar what you know here
from a rag system from a vector uh space
interpretation here of a cosine
similarity where you have closeness
absolute environments equaling semantic
similarity but this is you know advanced level
this hierarchical graph attention
mechanism is a key feature in graph
merge for the training in general. So
for each injected triplet here our HAT
performs a relation where attention to
encode the full triplet semantics into
the leaf embattings thereby of course
replacing the initial leaf token
embings. So we update here our
relations. We update the integration of
new knowledge into the knowledge graph
and this occurs in the input embedding
layer before the transformer processing
enabling therefore back propagation to
the relations during the mask node
modeling that we know from bird. And
this happens layer by layer in a
hierarchical structure. Therefore the
name hierarchical graph attention to
make sure the model captures you the
bigger picture of how facts connect. Now
new knowledge is built and integrated
with the old old knowledge quotation
mark which helps avoid mistakes and
build a reliable knowledge graph you can
trust because you can immediately find
what was the original sentence and you
hope that there are no hallucination at
all and here you have the complete flowchart.
flowchart.
This is not a flowchart. So we start
with a training data set for our graph
merge system. We have now unfortunately
with an LLM here that the technical
terms are discovered here the entity
discovery within an LLM. Then we have a
similarity matching and we have to have
a seed knowledge graph. So we have here
the papers and the seat knowledge graph.
Those are our input data if you want.
And then we have here the training of
this beautiful new graph merge system.
The encoder only part of a transform
architecture. Now this now can if it's
trained predict your details with graph
mode combine the tails token
unfortunately again with another LLM
because it was convenient just to take
an LLM and not to find a better solution
without an LLM. We have a similarity
check and batting base and we have now
an expanded added grown knowledge graph
with more knowledge. This is it.
Now you see the warters did use helper
LLMs not reasoning LLMs. No just helper
llms for three auxiliary task. First as
I told you discovering the head
entities. Second selecting the relations
for subsequent graph merge predictions
and combining single token prediction
into nice sentences into meaningful
relation aware tale phrases into chip
just is building nice sentences that
humans understand
but there's no logic in it.
So let's have a look at the mathematical
spaces and I know you have been waiting
for this of course me too. So we do have
here from graph mer the backbone. This
is here the chain of the root nodes and
this is here from the research paper
from medical for medbup or whatever you
have. This is the syntactic space itself.
itself.
We broke it down into individual token.
A token becomes a root node in the chain
and this part of the graph represents
yet the pure syntactic space. It holds
the original unstructured text providing
you the grammatical and the contextual
foundation of any knowledge that will be
extracted. Now in the paper they chosen
here a specific number. So each input
sequence is standardized to have a fixed
number of root notes
128 tokens they chosen. This is it.
Okay. And if a sentence is shortened,
it's just padded here with the pad
tokens. And if it's longer, it is
truncated or split. Standard part. So
let's have an example. The sentence
metformin treats type 2 diabetes has
become now a chain of root notes.
Metformine treats type two diabetes of a
Then we have the branches, the leaves.
Now just to make clear the leaves leave
in a different mathematical space in the
semantic space.
Each root node in the chain has a fixed
number of leave nodes attached to it.
They decided to go with seven just
chosen. This is it. You can choose 15.
Think of them as dedicated slots or
containers of holding semantic
information to this specific route. So
the leaves represent now the semantic
space. This is where the tail entities
of the knowledge graph triplets are
placed. And in the paper, each of the
128 root nodes has chosen seven leaf
nodes. This was enough for a simple uh
diabetus medical domain.
So this creates now a large regular grid
of 896 leaf tokens. And during the
training, of course, most of those
leaves are empty or filled with a pad
token. And only the leaves corresponding
to a header entity in a known triplet
are populated.
So you could argue well this is a sparse
graph. Yeah, but it is a beautiful
graph. There is no hallucination yet
except where we use three times the LMS.
But more about this later. And now the
crucial connection. Guess what? We bring
it together. Grafting your triplet onto
the graph.
So the head the head entity is the span
of a text found within the sentence.
Therefore already exists in one or more
root nodes. Let's say metaphor mean is
the first root node. disc grounds the
sematic effect directly to its textual
origin. We know exactly where this
information came from from which
original paper medical paper. The tail
the tail entity is placed onto the leaf
nodes that are directly connected to the
head's root nodes. For example, the
token type two clear would be placed
into one of the seven leaves associated
with the metformin root node and the
relation acts as the type of connection
or the edge attribute that we know from
graph theoretical mathematical physics.
Let's have an example. Chronic kidney
disease is a renal disorder.
The triplet the C triplet is chronic
kidney disease as a finding site kidney
structure. So what are the root notes
here? Chronic kidney disease CKD is a
renal disorder.
The seven leaves for the root note to
chronic the first one here become kidney
the relation has a finding site is now
implicitly defined by the connection
between the headspan chronic kidney
disease and the tails pad kidney
structure and this connection will be
explicitly modeled and trained during or
during the training by h gate by this
attention mechanism for the graphical system.
system.
So the final flattened sequence and we
did all of this because we wanted to
have a flattened sequence as an input to
our encoder transformer blocks looks
something like this chronic disease
disorder get structure pet by pat. So
this is exactly 128 root tokens followed
by 128 * 7 leaf tokens. So we have a
flattener representation where a known
transform architecture can work with this.
this.
It turns out Princeton tells us, hey,
this is effective. Why? Why is this
encoding so effective? It enables the a
joint mask language and mask node
modeling training. So the model can be
trained to predict the mask tokens in
the root chain thereby learning the
syntax and the mask tokens in the leaf
space thereby learning the semantics at
the same time in the same model. How
beautiful is this? Of course, it creates
a puffer for the gradient flow to the
relations and during the training. Yes,
beautiful. You know this. So here we
have another example now a screenshot
from the paper and I just want to give
you here a feeling for this. So we have
a sequence attack sequence one or says
yes it's about non-alcoholic fatty liver
disease. Beautiful. And yes and this is
it. The parts ecological paraphrase.
What is the head? This one. What is the
relation is a plays a role has this
position cause of associated with and
now the top 20 you remember the top k
the top 20 graph predicted tokens
and here you see this so for each is a
play role relation you have not a
predicted top 20 tokens
this is based absolutely on the new
knowledge graph
so if you I want to see this now in the
triplet structure. The head you have
here the head you have here the specific
relation it was trained on and this is
now the tail it provides
form from the graph mer predicted tokens.
tokens.
Now just to make this clear the
mathematical complexity is not trivial.
We are working with different
mathematical spaces. Now we have a
syntactic space.
This is more or less if you want what
our TPT did for us. But we have now this
in our model. So we have to integrate
the syntactic space with the semantic
space. So a vector space for grammar and
context like GPT training was all of the
internet to learn the grammar and to
learn the context.
And then we have the semantic space.
This is the knowledge graph. This is the
real given structure and dependencies of
our technical terms in medicine. This is
a vector space for meaning and facts
and yeah if you want anontology is a
formal specification of the terms and
the relations in a specific domain. We
have medicine and we have diabetes as a subdomain.
subdomain.
So let's have a look at the synthetic
space. When we say synthetic space we
are conceptually referring to a domain
of a language structure. grammar,
specific word order, parts of a speech
and the contextual relationship between
the words in a sentence. This is exactly syntactic
syntactic
how a sentence is built. The
mathematical representation in graph
this concept is given a concrete
mathematical form a highdimensional
vector space in the simplest form or you
can go to a vector representation
populated by the embeddings of the root
nodes. the tokens remember from the
original text from the original MET pup papers.
papers.
So every token in the source sentence
like metformine treats disease is now
mapped to a vector a 512dimensional
vector in our Roberto architecture and
the M's encoder encoder not decoder the
encoder layers are now trained via the
mask language modeling a specific loss
function that you know from bird to
manipulate these vectors based on their context.
context.
This is why context engineering is now
so everybody's talking about it.
The result is that words with a similar
grammatical function or a contextual
role will have vectors that are now
close by an absolute environment in this
space. For example, the vector for the
word treats and managers would likely
appear near each other because they are
both transitive verbs that often appear
Second space is the sematic space. the
vector space for meaning and facts. So
such a beautiful space refers to domain
of meaning, concepts, factual
relationships. It is about what things
are and how they relate to each other
ontologically in this domain. For
example, it captures here the fact that
diabetes is a disease and metapformine
is a drug used to treat it.
So this is now maybe the same
highdimensional vector space maybe it's
so different but it is now populated by
the embeddings of the leave node the
tokens that form here the tail of the
knowledge graph triplets and this space
is specifically structured by the mask
node modeling loss in the edge gate
module. So this training forces here the
vector in the semantic space to align
based on the factual ontological
relationship of the domain not the word
occurrences like we had in GPD. And this
is the beauty
the critical insight if you want now
from this paper by Princeton is let's
build a common shared space out of these
two subspaces.
So the syntactic the root space and the
semantic the leaf space
they can you can build them that they
are not separate spaces. Maybe you can
build them that they are just two
different views or two different sub
regions, subcomplexity regions of the
same unified highdimensional much more
higher dimensional embedding space.
This is here about complexity theorem
and mathematics. This is this is
something really beautiful but not part
of my video here. So again the syntactic
training the mask language model
arranges here let's say the cities are
the tokens on a map based on the road
network how you can travel from one city
to another with the correct grammar.
The mask node modeling for the semantic
training rearranges the city now on the
same map but based on their functions
like you group all capital cities
together or all industrial cities or old
coal cities.
So you see we built here those new
systems here in the same space.
Now you hope that the output of graph
mer is now the perfect knowledge graph.
added new knowledge, build up new
knowledge, integrated the very latest
knowledge data and solution a simple no
you lose a graph query language sparkle
or cipher like running SQL query on a
traditional database there's no
ambiguity there's no hallucination it is
just pure cold logic and this is what
you need in medicine this is what you
need in finance this is what you need in
theoretical physics and in quantum mechanics
mechanics
the graph database will execute any
query and return a list of drugs like
metformin or whatever with a
mathematical certainty.
This is impossible to do with a standard
LLM because an standard LLM like a GPD
system is just a statistical pattern.
And then of course you have the solution
B. And now it gets interesting graph rack
rack
remember now what's interesting is that
the authors use graph rack to validate
here the graph MERT architecture.
Show you this in a minute. But graph
doctor ask in a natural language. Hey,
what are the treatment options for a
patient with diabetes who is also obese
to the eye system and then this is AI is
retrieving now from a reliable knowledge graph.
graph.
It first queries here your graph merge
generated knowledge graph to find here
the relevant facts the sub networks here
small local subg graphs of the verified
triplet 1 2 3 4 beautiful and then it
just augments you the prompt you have in
context learning you have short example
prompt looks something like this you
have the system prompt hey your helpful
medical assistant using only the
following facts answer the user's
questions facts are 1 2 3 4 the user
question is what are the treatment
option now you see here the LLM now
generates an answer but its creativity is
is
I wouldn't say really completely but 90%
completely constrained it is forced to
reason over those four provided facts
hopefully but of course it can
hallucinate at any time in an amount
that is undefined
but it is hopefully reducing the
hallucination with our knowledge graph integration
integration
okay okay let's come to a summary
I know what you say. Hey, where are the
training algorithms? How do I code here
this? What are the unifying objective if
we build together spaces? Look, it is so
simple. You have a dual objective that
distills here the syntactic patterns
into the semantic prediction learning
the global relations across here the
complete text corpus.
It is simple, but this is just the beginning.
beginning.
Yeah, results. We have to look hey is
this thing working or was this just an
idea. So here we have the facts score
knowledge graph evaluation in percentage
points and there's a whole definition
and pages and pages in the original
paper. I just give you the result.
If you build a knowledge graph with an LLM
LLM
your percentage point your performance
is 40%.
For the context and 48% for context and
the general truth as defined in the
paper. Now if you have this new graph
merge system the context only jumps from
40% close to 70% and from 48 to 72%.
So if you build a a knowledge graph with
an LLM or GPD system or with the graph
merge system there's really a
difference. Look at this 40 to 70%.
Now of course you might ask hey but why
are those LLM knowledge graph fact
scores are so low and Princeton looked
at this and there's a complete analysis
and they said the LLM knowledge graph
scores poorly performed when validated
here on the Q32B
which flags here its own error and
this underscores here the weak prompts durability
durability
knowledge may exist in the parameters
yet the promptbased generation fails to
elicit correct ontology respecting
triplets. This is what I started the
video with with this beautiful example.
GBD fail, GBD5 fails, Guac 4 fails,
And they even did a further analysis and
they found three facts. The knowledge
graph built by an LLM by GBD system has
a relation misinterpretation. The GPT
model mapped the relation by lexical
similarity rather than in our domain
ontological meaning drawing only from
its internal knowledge and not from the
given facts. Second, you have a systemic
malformed repetition. The same
ill-formed triplets reappear across
different text chunks.
An overlining
a head tail pair. multiple largely
invalid relations in our knowledge graph
are assigned to the same entity pair.
So now we understand why we cannot build
knowledge graph with LLMs that are based
on GPT systems on the decoder port of a
transform architecture.
And now I know what you think. You say
okay this was the Q1 but you know let's
go to the best model maybe we have GPD5
syncing max full power. Let's do this.
Let's go for the most expensive best
goodlooking M GB5 full syncing. And you
know what? Look at this on the left
side. You have here the performance
here. 100% for graph mer never mind what
it is in detail. You have this over
pages on pages in original paper and
GBD5 syncing. Look at this. Now
absolutely equal performance. This is great.
great.
But unfortunately Princeton put this in
the first place to show us if you
compare this to additional medical terms
it looks like this.
So you always see here the left uh
indicator is here the graph merge system
and the right indicator here of this
bulk is your GPD5 max syncing mode.
Green is what we want, orange is maybe
and the red is failure.
So here on the term metaphor mean the
graph merge system achieved here if you
want here I don't know 95%
of yes and maybe but GBD5 syncing
achieved something below 40%.
So there is a massive difference but it
doesn't mean that statistically
you can get lucky with one term like
this term here but on all the other
terms GP5 thinking fails to build here
the knowledge graph.
Now if you look at this you say okay
here let's this 80% performance this is
85% performance 90% performance 95%
performance why is this not 100%. Why is
is graph mer not 100%. Why is graph mer
not better than 90%.
And here quote by the authors here by
Princeton the main issue in the graph
triples are vakeness and incomplete
tails. And you see how is this possible?
And then it turns out the tail
incompleteness arises when the helper
LLM the TPT
that we have to have as an auxiliary AI
system. The helper LLM accepts an
incomplete token as a valid tail during
the token combination. And I say it
doesn't matter what helper LLM we
tested, it's always the same. All the
LLMs elucinate. As I showed you at the
very beginning of this video here when
why graph mer is not better because
unfortunately we still have those three
times their llm that are not reasoning
but are just helping to filter out
things and they have now that they have
incomplete token as a valid tail token.
So we know where the error comes from.
What are the insights?
I think graph merge success is based on
a complete novel approach. It really
enforces here an ontological alignment
in the medical domain during the
training and extraction process. And
this is a good thing.
And instead the old way with an LM with
a GPT system was it was just guessing.
It was calculating statistical
fluctuation and probability densities.
the next token prediction based on a
surface level word correlations from
their training data.
But now graph mer learns semantic
relations embeddings that adhere to the
strict rules of a professional graded ontology.
ontology.
And this is the beauty now of the
system. And you see what we don't need a
GPD system at all for the next
generation of AI.
And you might ask, do we need opening at all?
all?
Do we need, I don't know, perplexity,
entropic models at all?
Or do we just discovered a new technology
technology
that would allow us to build just 80
million free trainable parameter models
and we don't even need the complex data
infrastructure from Nvidia?
Isn't this beautiful? what new
technology can maybe hopefully
theoretically achieve
and I think this is just the beginning.
So if you're interested hey why not
subscribe or maybe even join and become
a member of my channel and I hope to see
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.