This event, "Acceleration AI," focused on the advancements and challenges of Generative AI at the edge, bringing together industry and academic experts to discuss innovations in machine learning, edge computing, hardware-software co-design, and the future of AI.
Mind Map
คลิกเพื่อขยาย
คลิกเพื่อสำรวจ Mind Map แบบอินเตอร์แอคทีฟฉบับเต็ม
Hello everyone and welcome to the sixth
edition of acceleration AI. I'm Yin from
CMC micros systemystems. I am very
pleased to be your host today for this
annual virtual event which I have had
the pleasure of organizing since 2020.
This year workshop is supported by
fabric our latest initiative which I
will talk about a little bit later.
It has been uh inspiring to see the
growth of this workshop over the years
as it continues to bring together a
dynamic and expanding community of
researchers, innovators and experts.
Today we are proud to present an
outstanding group of speakers. Uh these
leaders in their fields will share the
latest developments in machine learning,
edge computing, generative AI and
hardware software codeesign.
I would like to extend a sincere thank
you to our distinguished speaker today
for sharing their time and insights and
to all of you for being here today. Uh
whether you are a professor,
researchers, startup founder or an
industry professional, your
participation is what makes this event
so valuable. Before we begin, please
note that this session is being recorded
and uh will be uh made available soon on
As we dive into today's workshop, uh
let's take a moment to reflect on the
overarching goals of this event. Uh our
mission has been uh is to bring together
experts from both industry and academia
to share the latest trends and
innovation in AI, explore the key
challenges and opportunities in cloud
and edge computing and identify
opportunities in for collaborations and
that can drive us forward.
Additionally, uh from CMC perspective,
we aim to identify the common
infrastructure requirements that will
support the growth and scalability of
this transformative technologies to
better support the Canadian ecosystem.
With this in mind, uh we are set for an
exciting and insightful event as we
navigate the intersection of AI and edge computing.
computing.
So this year workshop shines a spotlight
on generative AI at the edge with a
focus on the latest advancements and
real world challenges in building
efficient cost effective AI solutions
for resource constrained environments.
Our speakers will delve into topics like
model optimization and security
including techniques for fine-tuning uh
models and seamlessly integrating them
with edge hardware. We also explore
development in AI hardware featuring
architectures like risk 5 processors,
analog neural network chips as as well
as FPGAs and how these technologies can
help address environmental challenges
such as radiation effects in harsh
deployment setting. To conclude the day,
we will host a panel session on the
future of HAI where we'll reflect on
emerging opportunities and what lies
ahead for this rapidly evolving
field. Before we dive into the CMC
micros systemystems opening remarks
which includes CMC products and services
in IoT and HAI, let me quickly walk you
through today's agenda with some
housekeeping rules. Uh we have a packed
and exciting lineup of presentations
from our distinguished speakers. Each
speaker brings a unique perspective on
the intersection of AI and edge
computing and each presentation is
scheduled for 20 minutes 15 minutes
hopefully for presentation followed by a
5 minute Q&A session. And this message
is for uh my dear speakers. Please try
to keep your presentation time to 15
minutes. So when you see me appearing
your screen, this means that you need to
wrap up so we can move to the Q&A
session. So let's uh talk a little bit
about the uh the agenda today. uh our
first presentation from uh Kier Poland
from synopsis who will shed the light on
cost effective solution for generative
AI at the edge followed by Davis Sawyer
from NXP semiconductor who will present
secure fine-tuned LLMs for generative AI
at the
edge and professor Warren Ros from
McGill University will present parameter
efficient finetuning of
transformer-based language model using
data at Brunie. I would like to thank
again these three speakers. They've been
long-term contributors to the
workshop and we have a new speaker Borak
from Edge Signal startup here here in
Ottawa. He will be presenting the
implementation of generative AI in edge
environment uh challenges and solution.
We will have a break uh 5 minutes break
and then we will resume the workshop uh
a presentation from Katarina from Nvidia
who will present Nvidia edge AI stock
software and hardware followed by
another contributor long-term
contributor to the workshop professor
Franual Primo from Poly Technique who
will cover risk 5 Polar collaborative
design and of an open source risk 5
multiore processors then we'll have new
presenters from academia Professor
Liechen from University of Saskatchewan
who will cover radiation effects in
conversion network implemented with FPGA
and mitigation technique. Last but not
least, Nirage Matthew from Blue Mind
will switch the gear to cover all analog
neural network processor to deliver
highly efficient and performance AI
inferencing. The news this year is we
invite our distinguished panel panel
director Walter Knights a CEO from EIOT
Canada who will host our panel session
today which cover pioneering the future
of generative AI at the edge challenges
opportunities and innovation. So this is
a high level overview of our agenda
today and now let me give you some news
So as you know uh most of you know
fabric uh our latest initiative is uh
funded by ISET strategic innovation fund
and uh it is managed by CMC micros
systemystems is focused on building a
strong and sustainable semiconductor
ecosystem in Canada which supports
companies developing homegrown
semiconductor technologies encourage
collaboration across industry and help
grow Canada's roles in the global supply chain.
chain. [Music]
[Music]
Public challenge projects uh help
Canadian industry and academia develop
next generation semiconductor processes
products uh with a focus on photonics,
MEMS and quantum. IoT projects drive
innovation in sensors for clean techch
healthcare and telecom. Uh this
initiative uh provides design tools,
methods and prototype fabrication with
up to 50% reinforcement for industry and
full coverage for academia. Um these
challenges uh help straighten uh the uh
Canadian manufacturing supply chain.
The fabric innovation platform offers
tools, technical resources and training
to support a strong talented uh
pipeline, accelerates product
developments and drive world-class [Music]
research. Currently, the public
ecosystem welcomes Canadian
professionals, academic, government and
industry experts who are passionate
about semiconductors. Academics and
students keep access to their CMC
subscriptions like CAD, tools,
fabrication and basecam while gaining
access to extra training and resources through
through [Music]
[Music]
fabric. Now a brief uh introduction of
uh our latest development at CMC micros
systemystem in support for IoT and HI
ecosystem in
Canada. This slide showcase our end to
end IoT development process from concept
to prototype. It begins with project
launch including consultation needs
analysis and partnership. Then we move
to design selecting components and
optimizing the IoT architecture. In
manufacturing we handle supplier
coordination production planning and
quality control and finally our
prototype phase include embedded
software cloud edge integration testing
and the path to volume production. This
is of course high level. So if you need
more details uh we can schedule a quick
meeting and
uh walk you through all what is
available. Um we've developed an open
source customizable IoT sensor platform
with the KitKat PCB design including
schematic layout and bill of materials
all available on fabric github. And uh
this is a Bluetooth low energy mode. So
that supports sensor networks, machine
monitoring and electromechanical sensing
connecting to apps uh for data display
and processing. Uh these demo uh samples
evaluation and we are developing
application across various verticals
with this IoT sensor demonstrators. One
example is the IoT platform for smart
agriculture that enable easy
integration, field testing and real-time
monitoring of environmental parameters
like temperature, humidity and soil
moisture. It supports application in
greenhouse automation, environmental
monitoring, livestock farms and
automatic automatic irrigation and
more. On the edge side, we have we offer
a one-stop shop for development from
concept to prototype. We begin with
conceptualization. Uh defining the
problem goals and addressing hardware
software constraints. Uh we have a large
collection of data sets that we use for
training. Next we move to model
training. Uh depending on the problem we
help our client select the right model
for their application. Uh we have an
infrastructure that allow us to train
this model uh efficiently and we
optimize them for deployment at the
edge. Um and we have some examples to
show here. So for the flow we use for
the edge uh development uh we we use our
infrastructure for training. So we start
with pre-trained models and we fine-tune
them uh on custom data sets. Uh we use
uh mostly uh uh our cluster which is uh
powered by uh Tesla 100 for training and
uh inference testing and we finetune it
on custom data sets. uh then we test the
uh the trained model again and for edge
deployment we use a variety of tools to
compress and optimize these models to be
suitable for edge deployments. Uh we use
various uh uh edge platforms for
deployment. I will show some of them
here. So this is the infrastructure we
we continue improving. Uh on the cloud
side here we have uh the FPGA GPU
cluster where we use mostly GPUs for
training and we use a complete software
stack for optimization and we partner
with uh Enser and Storins who are
building custom inference chips for uh
low power inference uh applications. We
also support Jetson Orurin from Nvidia
for uh most of our IoT demonstrators,
edge AI
demonstrators. Here is one example uh of
the uh IoT demonstrators, the AI
demonstrators that we have built as part
of fabric. So we uh took uh
state-of-the-art uh computer vision
model Euro V9. We train it on a custom
data set. The main objective here is to
enhance work safety through realtime
anomaly detection. This is this is a a
big models that we trained and
optimized. We were able to run at almost
40 frame per second on a jet or in real
time. And this allow for example to
detect workers who don't wear their
safety equipment. And the second uh
example here is a generative AI based
prompt vision for advanced video
analytics. And the system we have
developed uses uh our VIT which is a
powerful vision language model to detect
objects in real time based on natural
language uh prompts. So you just type uh
what you are looking for like helmets or
people with bags and the model instantly
highlight them in the video stream. The
front end application we have developed
show live statistic of for each detected
object uh with uh the time it was
detected and uh its number of occurrence
all while running efficiently on the
jets in orange. So we did a lot of
optimization of this model. So it run
fast in real time and this is a part of
our support and fabric to the uh
Canadian ecosystem who want to uh
integrate edge AI in their application.
Now before we dive into the workshop
topic uh I would like to go through uh
some uh high level uh uh market trends and
and
So we expect uh 29 billion devices
connected by 2023. This is uh an annual
increase of 12%. And the drivers IoT, 5G
and AI these are not buzz word they are
transforming many sectors healthcare
industry for all environment precision
agriculture smart city the data uh we
are generating uh massive sensor data
and the data is doubling every two to
three years and 75% of this data is now
uh currently processed at the edge. This
is a rise from 10% in 2018. This has uh
puts a lot of uh pressure on edge
computing where you need very high
performance low power edge computing
capabilities at the edge in order to
deal with all this data. The edge
transformation as most of you knows 60
to 70% of task automated by genai by
next year and 60% of it is multimodel.
Uh this is a rise from 1% in 2023.
This is extremely fast. Uh it's similar
to switching from a flip phone to a
smartphone overnight. So it's it's
really something that is that industry
is trying to capitalize on its uh
capabilities. The edge AI focus as we
know uh AI is continuously moving from
the cloud to the edge because of these
advantages low latency, bandwidth
saving, data privacy, security and
autonomy. What about energy? 90% of the
power is consumed by data movement. This
is this is a fact and uh this has led to
uh an increase or high demand for
energycentric hardware including uh new
innovative approach on classical
computing and even some uh advancements
in photonics and wideband gap
semiconductor materials. And if you need
to know more about photonics and white
banks gaps semiconductors, we have a
team dedicated to these uh different
technologies also the exploration of
quantum spiking and analog architecture.
We have a presentation from your mind
about analog architecture today. So I'm
looking forward to to know their latest
advancements security and optimization
trustworthiness which uh combine safety
security and reliability and privacy. Um
there is a need for uh standards to
ensure interoperability uh for adoption.
So these are high level trends and I I I
I think the speakers also have some uh
they cover some of these. So we will see
uh if we are aligned here. So back to
the workshop, I would like to start with
uh our first speaker. So kicking off our
lineup is Pierre Polland from Synopsis
with over 30 years of experience in AI,
noral processing and embedded systems.
Pierre has helped shape cutting edge SOC
technologies across multiple industries
and today he'll share his vision on cost
effective solution for generative AI at
the edge. Welcome Pierre. Please share
your screen. Thank you for the
introduction. Let me share. So, hello
everyone. I'm Pier Ple. I'm with
Synopsis. Um, based half the year in
Ottawa, Canada, and the other half in
France where I'm uh where I'm presenting
today. So, normally at 7:00 p.m. I've
had a glass of wine. I have not in in in
respect for this interesting workshop,
but I'll have one to celebrate in 10 or 11
p.m. outline for my talk is a quick
introduction on the latest trends in uh
transformers and uh which are the basis
of generative
AI. My audio has gone quiet. Can anyone
hear me?
We hear you perfectly. Yeah. Okay,
great. I can hear you. Perfect. Uh then
I'll give a very short introduction of a
product we've developed called the NPX6
which is a neural processing unit and
then look at the um uh key features of
these units in order to support these
these transformers and therefore Gen AI
and then we'll look specifically at you
know the challenges of genai mapping to
a neural processing unit like the NPX6
and if I have time a quick
outlook probably that will be for the panel.
panel.
So uh amazing uh change I I entered this
space of vision back in 2010ish. Um we
were working in set top box in my
previous uh job at ST micro electronics
in Europe and France and um at that time
we were doing algorithmic uh you know
applications. So we call that classic
computer vision using DSPs. Um and at
that time the you know the one of the
best al one of the best object detection
was called SIFT. um and that had about a
50% accuracy for the imageet top one
number. Uh the revolution happened uh at
University of Toronto um with AlexNet.
It took a you know quantum leap from 50
to 63% in in in a year or two. Um and
that's really when we move from what I
call the prehistoric vision times to uh
the the age of CNN's which is already
the medieval times uh in in terms of
what we're doing today. And you know if
you've been in this space you know
AlexNet became VGGG ResNet and then kind
of saturated these yellow uh CNN
saturated at kind of 90%
um accuracy which is which is still a
pretty good number but a ton of
innovation over at least 10 years to
achieve that with hundreds of papers uh
hundreds of groups and thousands of
papers. And then the the transformer age
started in in in 2020. And literally
within uh first transformers were
developed in the context of natural
language processing and then there was a
an an application of this approach to
vision and within 6 months they had
already caught up with the you know the
best efficient net uh you know CNN and
we're exceeding that with VIT and
applied to vision.
Um and so you know we're kind of hitting
kind of the the asmtoic limit of you
know the information contained in this
in this imageet. But the key message
here is that literally transformers
revolutionized uh CNN's and beat CNN
results in in less than 6 months and
Um so transformers uh were as I've just
mentioned were developed initially for
natural language processing and that's
you know it's at the basis of things
like net chat GPT um but the really
exciting work because we were working in
vision and the exciting uh uh learning
that happened in 2021 is that
transformers as is uh can be applied to
other domains like with like vision with
very little
modification and that we've discovered
that models that combine attention and
classical you know CNN's convolutions um
u out outperform you know the old CNN's
even for small models and initially they
were quite big but we've seen you know
things like VIT become mobile VIT and
get smaller and more
compact so we truly believe that these
transformers and the attention uh uh at
the at the core of a transformer
is are here to stay. Here to stay,
sorry. And let's give an example of
that. Why do we think they're here to
stay? Let's take a look at what the
state-of-the-art for CNN based
applications, something we call panoptic
segmentation. That's where you have an
image on the left hand side. And the
panoptic segmentation state-of-the-art,
you know, convolution neural network
will then be able to identify uh
instances of different classes of
objects. So cars, for example, you have
a uh the taxi in blue, the uh the
minivan in green. Uh people um and
that's about it in this case. Um beyond
just recognizing the objects, it
recognized different instances, so
they're in different colors. Um and then
it also does semantic segmentation. So
it's not only recognizing the car, but
its exact uh contour. Same thing with
the with the person.
So this was the state-of-the-art uh only
three maybe four years ago uh in
CNN's if you're uh building a autonomic
autonomous driving system this is very
shallow information about what this
scene is now this is an odd scene but
let's take a look at we say take the
same scene and we simply ask the
question what is unusual about this
image and we apply this using
lava and the lava response is the
unusual aspect of the image is a man is
ironing clothes on the back of a yellow
minivan while it is on the road. This is
an unconventional and unsafe place to
perform such an activity as ironing
clothes etc etc requires stable surface
blah blah blah. Ironing clothes and
moving vehicles could lead to potential
hazards for both the person doing the
ironing and the other road users. Now,
if I'm building an autonomous drive
software stack, this is the
interpretation I need in order to react.
Otherwise, the other scene just says
there's a pedestrian, there's two cars,
and doesn't say anything about what's
going on. Now, reading the text, it's
obviously some lawyers were involved in
the training here, but uh it's it's
still quite remarkable um this uh
richness of
interpretation and and we really believe
that this richness is needed for the
next generation of AI and as as we've
discovered in the last year or two.
So let's switch gears a little bit um
and talk a little bit about our neural
processing unit, the
NPX6. Um we started this project uh we
were in Canada and some of my key
architects actually studied with Jeff
Hinton at University of Toronto in the
80s and my hardware architect studied
under uh Yosua Benjio at Mel and so I
got lucky. Uh, I actually resisted this
at at the time and they didn't get fat.
Right. So, 42% of Americans are obese. It's
It's
Yeah, you can keep going. Yeah.
Okay, we're we're good. Yeah, we have
muted some some participants here. Some
some not so pleasant comments about
Americans. That's all I heard. Okay.
Okay.
Um, yes. So, I I got lucky. I two my
software architect and hardware
architect came to me in 2012 and said,
"Pierre, uh, AI is really cool. Look at
these new CNN's." I was a bit skeptical
to be honest. Um, I had worked in AI in
the '90s with, you know, knowledgebased
expert systems. And that was a failure.
Um, but I took a look. I read the papers
and we said, "Okay, let's let's put
together a small task force of, you
know, five, six people." And we built a
small first generation uh processor um
which we we we delivered in 2014. Um and
then we saw in two by 2014 it was clear
this was going somewhere and then we
developed since then uh uh four other
generations of this what we'll call the
blue the blue color here which are based
and optimized for CNN.
um back in 2021 we could or back in 2020
even we could see the importance of
transformers and this new generation so
we made a big skip to uh our next our
sixth generation the the NPX6
um and that's the basis of our current
product and that was a big uh
discontinuity we learned a lot from CNN
of course but it was not flexible enough
uh to to accommodate uh these new
applications and the you know natural
language processing or Swing T and VIT
applications. So the the MPX family in
fact we have three families of cores. We
have a low-end microcontrollers uh that
operate you know below 100 GOPS. We have
a vector DSPs, a general purpose DSP
families that used to be used for
computer vision and now just are are
more general for for vision and they can
do low-end AI applications you know
below one tops and then once we get to
one terops uh we have a fam a scalable
family that starts at the npx6 1k the 1k
means 1,024 max and then you have a 4k
which is 4,96 all the way up to 96k uh
which is 98,000 and Then that's our
largest uh single. So roughly 100,000
max. That's about uh 200 tops. Um and
that's our biggest single single NPU.
And then we can instantiate up to eight
NPUs. And that'll get us beyond 2,000
tops. Um we've introduced this two years
ago. Uh since in those two years, we've
licensed uh over 25 uh leading edge
customers. half of those in automotive
and some of those are at 2,000 tops
today. Our high-end uh automotive
customers, leading edge automotive
customers are uh in the 1,00 to 200
tops, 2,000 tops. And we have other
extremes of um in in vehicle
infotainment at one tops uh or uh low
power digital still cameras uh leading
edge uh consumer applications at one
top. So there's three orders of
magnitude between our low-end and our
architecture. So this is a uh quick
overview of the architecture. Um it's a
scalable architecture. It starts with a
a set of cores shown in yellow here uh
from one to a maximum of 24 cores. Each
core internally um has a uh two key
components. convolution accelerator that
does kind of the uh CNN's and matrix
multiplications. Um and that's uh is
4,96 max can run in integer only or with
a floatingoint unit option. And then
attached to that is a generic tensor
accelerator that does anything that's
not matrix multiplications or or CNN's
um things like activation functions but
a whole whole bunch of other functions
that are not
CNN's. And then finally um we have a
complex multi-level memory hierarchy.
Each core has its own level one memory
inside the core. There are 24 of those.
And then we have a level two shared
memory with a high performance and high
low latency interconnect custom network
on chip um that moves data between the
24 up to 24 cores and the level two
shared memory and of course the external
memory external DRAM memory.
Uh each of these cores has its own local
DMA and we have a top level DMA called
the streaming transfer unit and each of
these cores has its own internal
controller a small risk controller and
the top there's also a couple of
controllers at the top level.
Um even though the block diagram you
know talks a lot about hardware uh this
is a large you know doubledigit team um
and exactly half that team is developing
the tools which is probably the single
biggest challenge even more difficult
than the architecture design. So those
are compilers runtime SDKs simulators
platforms. So key architecture features
one of the the the objectives we gave
ourselves when we moved from our fifth
generation of CNN based machines to this
more general class was really to go
beyond CNN and support things like uh
RNN which was already getting old but
mostly transformers and genai recommener
networks are the uh you know the classes
of of uh of applications that emerge in
2021 and beyond but not only uh moving
beyond AI applications but also the
types of sensors we're using. So,
initially we're mostly focused on
vision. We generalized to multiple
sensor classes like radar and lighter
which are heavily used in automotive
which is about half of our customer
base. The other lesson the hard lesson
was flexibility is is essential
everywhere. Uh we kept on thinking CNN's
were going to stabilize at some point.
Um we were proven wrong every
generation. So we added more flexibility
and in this architecture we went even
much further. So we have a fully
programmable um what we call our generic
tensor accelerator which complements the
CNN. Both are extremely flexible and the
the uh the generic tensor accelerators
is fully
programmable. Uh we also went wider in
data types with integer a 16 integer 4
as well as an option for floating point
16 and brain float
16. So that's on the flexibility side
which is a key objective. The other
objective is continued improvements in
efficiency. Um our uh we've seen a MAC
utilization improvement of about 1.5x to
two based on all our lessons learned
around you know state-of-the-art CNN's
like mobile net efficient net and then
focusing on you know genai like stable
diffusion uh lambda 2 and so on. uh we
also uh brought in sparsity. So uh we
have a form of structured sparity very
similar to what's used on on general
purpose GPUs like Nvidia. Uh those of
you are familiar it's called structured
sparity. You get a somewhere between 1.4
to you know close to 2x performance
increase by using this structured
sparity. Um all the R&D is around
bandwidth reduction. Uh the challenge
there is moving data and that's the
challenge in power and the challenge in
complexity and software tool features is
all around data movement. It's not
around putting down hundreds of thousand
maps. That's the easy part. Putting
memories all over the place is easy.
It's about intelligently moving the data
through the
architecture. And then uh we also
improved latency because it's not only
about getting high throughput using high
batch sizes as was the tricks used in
the you know the 2000s and 20s early
2020s people use high batch size in
automotive it's not about it's not about
throughput it's about latency. It's the
time you detect a pedestrian or a guy
ironing on the back of a van. Um the
time to detect that was mostly important
and not as much the
throughput. And finally, we continue to
do power efficiency improvements based
on uh different uh techniques,
gaming. Um I'm not going to explain this
just to say if you add up all the
different things that happen, you have a
level one risk core doing control, you
have a DMA, you have a convolution
accelerator, your generic tensor
accelerator doing activations and soft
maxes, and then you have your output
DMA. Add all of these activities
together, you have 13 ways, 13 parallel
activities that need to be sequenced and
scheduled. Uh, which gives a hint of
some of the challenges of dealing with
the complexity of these
architectures. So, we've invested
heavily like I said exactly or even a
little bit more than half of our team is
building these compilers and runtime.
So, it takes standard representations
like PyTorch and TensorFlow. We convert
that to the industry standard
representation Onyx. uh we compile that
it generates an execution plan
interpreted by a runtime and then 99% of
this runs on the NPU but you can have
special secret sauce for certain
customers that may not run directly on
the accelerator and can run on our uh
vector DSP family and that is done also
by the tools.
So different use cases is you know
exploration uh we we do we compile onto
virtual platforms like the platform
architect and virtualizer which are
other tools developed in other groups in
synopsis we have function and
performance models we have emulators and
boards so what can you do here's a
simple example we have a YOLO V5 um
you're going to be exploring the impact
of u throughput uh depending on
bandwidth. So you might start off at 250
GB per second as an upper bound on
bandwidth assuming you have expensive
HBM interfaces and you're going to look
at the impact on the frame per second of
of bandwidth. The other dimension here
shown by the colors is you're exploring
sizes of onchip memory. The CSM is our
level two cluster shared memory. So the
purple curve has no onchip uh level two
memory while the yellow curve at the top
has 16 megabyte for this for this
machine. Um and you can see that there's
different trade-offs with more memory.
Uh you're less sensitive to bandwidth
because you can store more data on chip
and therefore not as sensitive to DAM.
And you can see that let's say you had a
target of 500 frames per second. Well,
basically there are quite a few data
points that meet that target and
therefore you can trade off well do I
want to spend more money on bandwidth uh
which has you know cost and power
impacts or do I want to spend more money
on memory uh in order to reduce
bandwidth and therefore with the you
know the green curve for example which
is a nice trade-off you can see you can
achieve 600 frames per second with 32
GBTE per second as a nice trade-off of 8
mgabyte or even going down to the red
curve which is 4 megabyte and still
achieving just above 500 frames per
second. So the tools allow you to do
this automatically and that's a key
thing here because doing this manually
is a non-starter. The complex of these
machines do not allow you to do even one
of these 25 data points in you know less
than days and weeks. Well, this can be
done in a couple of minutes on our exploration
tool. So I talked about transformers and
I'm going to go fast. I think we're
running out of time. Um, just to say
there's key features. There's features
in the convolution accelerator that are
unique and different from CNN's. So
things like matrix matrix multiplication
instead of just matrix vector
multiplications used for CNN's. You need
matrix matrix. You need feature maps
appear on both operands and not on just
on one side. Those are just an example.
You need this very flexible generous
tensor accelerator to do things like
softmax, layer normalization, new
activations functions like
GLU. And then finally, you need a
dedicated DMA that does complex things
like embedding lookups. Uh the all
features needed to support the the
constructs of a transformer which are
the basis for Gen AI.
So just to give you a sense of uh you
know what's the efficiency um we've run
you know vision transformer vit and swin
for different imprint sizes. This is on
our single core the 496 MAC and you can
see our maculization is varies between
60 and 70%. And our bandwidth is uh you
know in the range of an LP DDR5
somewhere between 20 GB and 30 GB. Of
course, under NDA with our customers,
you can get the exact numbers and just
giving you a range here. Um, a key
message here is that if you run these on
a GPU, Mac utilization is typically
below 5%. Sometimes 10, rarely above 20.
Um, these machines in the embedded space
need to be much more efficient and the
tools need and these are more dedicated
machines than a GPU and you can get much
higher utilization at very low area and power.
power.
So, genai stable diffusion the example.
Um, I'm going to skip this because we're
out of time. It'll be in the material we
leave with with the scene. Uh, I just
want to show where this compares. So,
our NPX 6 32K running in, you know, uh,
dense mode, uh, where there's no
sparity. We'll match a RTX 3060. The 32K
with structured sparsity will match the
Titan RTX. This is about uh 30 frames
per second for stable diffusion version
1.5. Uh so that's a $200 machine. It
consumes about 200 watt. Just as a
ballpark, these machines are less than
10 mm square and 5nmter. Um a Titan RTX
is many hundreds of millime square. And
maybe even more importantly, it consumes
less than 2 watts compared to 200 watts
on a general purpose GPU GPU like a
Titan. And that's kind of our mid
mid-enge machine. We can go higher, of
course, with the 64K. And then we're
approaching, you know, the the
state-of-the-art a year and a half ago
when this chart was was
developed. But I think the key message
here is by specializing by developing a
neural processing unit, not a GPU
variant, you can get these two order
magnitudes of power reduction and an
order magnitude or two of area reduction.
This applies to you know stable
diffusion but we also can apply this to
you know genai like lama 2. Uh all I
want to say about lama 2 is that this it
really the challenge of lama 2 is is
bandwidth limitations. So you need to do
tricks to reduce your coefficient size
go from you know integer 8 to integer 4.
Internally you can use higher accuracy
like integer 16 or or FP8 FP16 to
preserve accuracy but on the the the
bottleneck to DRAM is the coefficient
bandwidth and the large model sizes. So
what we've discovered is that basically
if you use all the available bandwidth
we're matching any public result using
the same amount of bandwidth because it
is completely bandwidth limited and not
resource limited which is not the case
for stable diffusion where you do have
it is compute
limited. So to summarize and I ran a few
minutes over time transformers are the
baseline for these deep learning models
that were developed initially in the
field of natural language processing.
We've discovered in the early 2020s that
we uh in six months we're able to
achieve state-of-the-art vision results
in other domains and we developed at
that time we started the design of the
uh NPX6 generation in
202021 and we took a bet that
transformers would be there to stay.
That bet so far has been uh proven right
and that we've seen not only
transformers remain but they're the
building block for these latest
generation of of Gen AI like stable
diffusion, lama 2, chat GPT and the very
latest uh you
know um mixture of expert B based
approaches like DeepSeek um that go
beyond and make more compact models by
having a dynamic model size and we can
support that as well. We've already done
the uh the uh preliminary benchmarking for
for
deepseek. So this is moving quickly.
These initially were in the cloud on
high performance uh high cost high power
GPUs and they're moving quickly in the
embedded space and we believe we're
we're we're prepared for this and we're
space. Thank you. I'm happy to take one
or two questions.
Thank you Pierre or in the panel. Thank
you Pierre. Just a reminder, you can
post your questions uh in the chat and
uh they will be addressed by the
speakers. So since there is no questions
in the chat, I I I I will I I'll have a
question for you, Pierre. Uh how do you
keep up with the rapidly evolving AI
models to ensure hardware architecture
remains compatible and efficient?
Yeah, so so far so good. It's been 5
years since we built the spec. So we
took uh fundamentals in basic primitives
as the building blocks and we made
everything programmable. Um that being
said it's programmable and it's
efficient around a certain class of
applications which are today you know
transformer dominated with a lot of
flexibility and complexity but still
they're CNN and
transformers. Hopefully that's still
true for the next couple years and we
therefore we have a market that's uh
that's that's valid for us. If there's a
new completely new invention, we'll
discover with
the with the rest of the world. But for
the moment, we don't see this um the the
choices we made in flexibility around a
class of of transformerbased and you
know genai based applications. So far so
good. But I cannot you know I don't have
a crystal ball but will there be a new
you know invention in two three four
years this happens in in AI every five years.
years.
I have a question for from the chat. Are
you planning on targeting the bigger
parameter models or is there interest in
smaller more specialized models as well?
That's more a question about our
customers and the answer from our
customers is the latter. Um in our space
we're in the embedded space. We're not
in the cloud. We have one or two
customers kicking the tires but most of
our committed customers are in auto half
of them are automotive. The other half
are consu high-end consumer, low-end
consumer, and all of those are really um
these more specialized models with
smaller because it's completely
bandwidth limited. So, it's not
realistic to use these these these large
models that are used in the in the cloud.
cloud.
Okay. Thank you, Kier. Uh next up is
Davis Sawyer from NXP Semiconductor, a
Canadian tech entrepreneur and AI
products marketing manager. Davis also
chairs the EdgeAI foundations industry
working group and brings a unique blend
of business and technical insights.
Today he'll talk about secure fine-tuned
LLMs for Gen AI at the edge. Welcome
Davis. Awesome. Thank you and and thanks
PR. Yeah, great uh way to kick things
off. A lot of great background and
context on how we've come to this place
and looking forward to diving in. Um you
know yeah it's definitely true that I
think one key insight was you know CNN's
were computebound and transformers are
now more memory bound and it's true that
we definitely at the edge especially
edge semiconductors play uh we we kind
of inherit what happens at the cloud and
some of the innovations there and then
you know look for markets look for
opportunities and build silicon that can
support that. I think interestingly for
this talk I'm actually despite NXP being
obviously a semi-actor company uh I'm
actually going to spend more time
talking about software and some of the
tools that we've built on top of our
SOC's and products that help make it
easier to deploy whole applications. Um
we definitely need you know these
benchmarks and these uh you know
testimonies to performance as a way of
uh both attracting customers and also
backing up and justifying is this you
know viable for for product practical
use cases. In this talk, I'm going to
show some of the some of the software
pipelines we've built um that are now
available, which is exciting. So, I'll
definitely point to some links and some
GitHub repos that the audience can can
access as of today to start seeing for
themselves some of the the value we
think we've created here. But, I'll dive
in assuming you can see my slides,
assuming you can hear me. Uh, everyone's
kind of gone quiet, so I'm going to jump
in here. I know we have limited time, so
I'll try to be as effective as
Yes. Okay. All good. All good. Thank
you. Excellent. So, here's the high
level overview today. What we call the
intelligent edge, which I'll define a
little more specifically. The edge can
be enableless term. So, I'm going to try
to be precise in in terms of what we
target. I'll also give a high level
overview of our AI software stack and
neutron MPU. You know, like like
Synopsis, like others, we have a
portfolio of in-house, but also licensed
MPUs um that I think give a a good range
of flexibility to our products that meet
different workloads. It's really about
rising to meet the needs of of what the
application demands and and having the
support for that both from throughput,
memory, CPU usage uh price performance,
power performance, all that kind of
stuff. Uh I'll then do a deeper dive on
our genlow and rag database generator.
These are two distinct software tools
that I think uh help create those
fine-tuned secure LMs we referenced in
the title. Um then I'll give a bit of a
what I think is a sneak peek to the the
future of where we see the edge market
which is enabling multimodal geni and
some recent strategic moves that NXP has
made to help support that again from our
product portfolio. I'll wrap up the
summary hopefully some questions and
looking forward to the panel as well. So
when we talk about the edge uh and we
talk about the opportunity we focus on
of course there were some some companies
you know named name named earlier and in
GPUs space specifically that dominate we
think is the training opportunity and
there is some training shift to the edge
definitely see that in factories uh
maybe locally for smart home hubs maybe
automotive as well where you have some
kind of you know connectivity shaping
how how these models are updated which
is the training part um but we focus on
the the prediction part and I mean one
of my favorite sayings in the space is
training is once and inference is
forever for any specific model so I
think that there's a big opportunity
when you focus on even just the
inference piece um and what that means
for a IML uh software and supporting uh
you know
devices how NXP looks at the intelligent
edge has a few lenses and there's
definitely not enough time to cover all
of our enablement today and all of our
options and portfolio so I'll focus on a
few key messages one key message is we
scale up from you know our MCXM MCUs
which are kind of the lowest footprint
size physical size uh you know power
power consumption ratio um that's
enabled by neutron on as of today we
have a lot of say time series and sensor
data driven by that some interesting uh
use cases that have already been built
over the last few years really we've
also introduced more recently this uh
crossover brand of MCUs under the IDMX
um flag that really support again you
know in the wearable category also power
power sensitive power sensitive use
cases but also some ML capabilities
penetrate into that market then our IMX
application or microprocessors really
that's our the bread and butter of
computer vision voice time series data
as well scaling up to the kind of stuff
that Pier was alluding to with
transformer-based workloads which of
course dramatically improve on the
accuracy you can do with computer vision
use cases or similar perception use
cases in general but there definitely is
this new class of applications that are
enabled by the reasoning or cognitive
abilities of LLMs and vision LLMs um
which of course I'll actually give some
demos of in a second which will be
pretty cool and so this is built on top
of our our EIQ software stack for our
customers to have an easier path and and
faster path to market by really
demystifying and simplifying a lot of
the stuff that has to happen I mean you
don't need maybe optimize every model as
as thoroughly. You don't need to do
retraining in many cases. But for those
cases where you have say the whole
spectrum of offtheshelf versus heavily
optimized, EIQ really rises to meet
those demands. We have a few specific
components that are a bit newer that I
won't cover again in depth today, but
want to mention time series studio
that's initially focused on MCUs but now
running on ID Automix as well specific
specific SOC's. Um that helps again for
AutoML for time series models. I will
focus today on the geni flow because one
of the themes is geni and I think
actually we cover almost every mega mega
theme mentioned for today by us at the
start um except for the AI and harsh
environments which will be cool to get
to but we cover model optimization we'll
cover software tools we'll cover
hardware design we'll cover um geni at
the edge of course and so again this is
meant to be flexible with both you know
productivity enhancement but also energy
efficiency and of course the performance
needed for the
task wouldn't be a good AI talk if we
didn't mention EIQ Qutron MPU which
again fits that scalability story where
we're scaling up from the MCXN you see
on the lowest end here with that you
know iteration scaling up to products
exist today with again external NPUs I
think that's what you know how NXP has
approached the market and may continue
to with with how we try to produce a
flexible product portfolio so our MX93 I
use the ARM ethos5 microMPU pretty
capable engine with with a software
pipeline you know and FQ components that
really help get the most performance
possible there our X8 plus has been in
the market for a few years and then our
our kind of what I'd call our lead
puncher flagship which is for ji
applications which is our IMX95 also
available today that actually uses our
in-house MPU which on paper we list as a
two top engine I actually think when you
see the performance it it punched above
its weight that's certainly true of
CNN's so your classical your not
classical but classic uh classification
and uh object detection segmentation
CNN's some of the stuff that was covered
earlier again to the newer generation of
models that are transformer-based um and
power workloads like vision transformers
of course And actually, I'll focus a bit
more on the LLM side. Um, we've built a
really capable voice UI, voice AI
pipeline that when you drop in an LLM at
the edge, uh, for both privacy and
real-time response reasons, when you
have the silicon to power it, um, can
create really interesting HMI, you know,
human machine interface and other
application spaces as well that weren't,
let's say, possible a few years ago even
until we had this transformer
breakthrough and then of course the edge
silicon to power
it. an underrated part because I think
it's a con, you know, one classic uh
perspective of AM practitioners is that
it's all about AI but I actually think
you know especially how NXP approaches
it with security in mind and really
best-in-class security even as I've been
recently learned postquantum crytography
which is super important for financial
automotive and and regulated domains. Um
we deliver that today. So when you
combine security plus intelligence at
the edge uh I think it's a very very
compelling offering and that's what NXP embodies.
embodies.
Now I'll go a little deeper into GIF
flow which is something I'm I'm happy to
say that I I own and drive here at NXP.
Um and I have a lot more innovation
coming in the back half of this year.
This actually exists today and I'll
point to a couple resources but to give
you a sense of why we built this and
what the problems it solves. I'll
actually look forward first at what are
our JI ambitions? What are we trying to
do here? So we already play in a wealth
of markets. Automotive being one
consumer uh to some degree mostly smart
home industrial uh smart building power
management uh wide breadth. Because of
this wide breadth, we have a big
opportunity to bring gener into these
domains that some already have some AI,
some AI is a bit newer technology, but
all the same, we have these new
capabilities that we want to be able to
leverage. And this is just I'd say not a
not an exhaustive list, but certainly a
good glimpse of places where we're
making a difference
already. The challenge though is you
have this geni landscape that we all
know is is changing rapidly. And so the
way that I tried to tried to parse this
and tried to present this for you know
audiences like today
is you have this core stack in the
middle. These are libraries we're
familiar with like the AI frameworks.
Then you have a lot of let's say
necessary components that you need to
either support or interface with somehow
also communication protocols as well as
be another big one for us. But for it
comes to edgi stack this is the kind of
the classic stack you know sandwich
diagram that we see a lot of places I
tried to bifrocate that from the moving
parts of AI that actually are really
dynamic. you get lots of innovation but
you know do you need to focus on as your
core expertise and then if you don't how
do you leverage it and so the takeaway
that I tried to distill for for us in XP
and for the edge is to solve problems by
creating optionality to leverage the
best of what changes so as we saw in the
previous talk you know leveraging
transformers huge deal and that actually
probably will change frequently what
comes next we don't know but having the
optionality to leverage it will be
important and so we also want to
optimize what doesn't change as often we
have this paradox of models changing
weekly or monthly but silicon and
products that need to live at the edge
in in devices in products for you know
10 to 20 years in some
cases excuse me so because of this
paradox we need to find a way to balance
both uh performance so getting the best
of what's possible while while having
longevity and I think when it comes to
IMX in particular longevity is a big
part of of what we do and stand for now
challenges we have with geni industry um
we should all be familiar with this from
even just playing with chatbt and other
other off-the-shelf or API based LLM
um they're limited in context
understanding. Obviously, they're
carpentage expensive. That's a given.
They have hallucinations, so they just
they make stuff up and combating that is
actually a pretty deep part of how they
work. That is not something readily
solved, let's say, unless you have some
some some powerful software
contributions, which again, rag is one
that's talked about a lot and I think is
here to stay as well. Um and and we
focus on NXP. And then errors and
reasoning. All this to say is that if
you want to use this for a a a use case
where there's material or people
involved, um you won't you can't have
any of these issues. You have to address
them somehow. And again, this happens in
software in most cases. And the sweet
spot we found by kind of assessing
what's out there and the approaches we
have. Uh on the bottom, you know, this
is how most of us use it for day-to-day
tasks that are not mission critical. On
the top, that's limit to a very small
audience in the world with the skills
and and and compute to do so. But here
in the middle, kind of goldilock
scenario, we have this ideal performance
overhead trade-off. And what we like
specifically about rag is it protects
users IP from the model creation even
from ourselves actually. We want to have
this environment where you have this
again secure and fine-tuned LMS without
compromising you know things you can't
compromise and this also helps you know
lower the the time to market where
you're not retraining um because you're
creating a database you're creating
something that can be stored on on the
edge device by parsing domain knowledge
specific knowhow in different forms to
be interfaced with an LM and actually
voice UI as we've done the genifi flow.
The other problem of course is model
size. I won't belabor this point but I
think one interesting you know thing we
have is that for every billion
parameters we have about a gigabyte of
memory we need not not bandwidth but
actual actual memory uh you know store
store these weights and models um so in
the integer a precision that data type
it's around a gigabyte and what this
means of course is as Pier alluded to is
you me these models tend to become
bandwidth they're memory memory
bound so because of these two precursors
we want to have fine-tuning we want to
have optimization we've baked this into
a software tool we call GIF flow and
this program is available um for free to
deploy you know off-the-shelf models. We
also have a commercial version that we
we provide for customers on on again the
IMX95 is the flagship product today
possibly other other families in the
future but that's where we really bring
both of these the capabilities the
finetuning to adapt these LMS to your
domain knowledge without compromising IP
but also eliminating those errors and
reasoning those hallucinations that you
have to for industry use on their side
optimization to get the best performance
the performance you need really I would
say the best performance is always
needed but acceptable performance
actually gets the job
Uh a little more deep dive on this would
actually be you know it's made of
modules. These modules are kind of the
building blocks DJI use cases that we
found to be quite common. So it comes to
voice you have kind of speech in speech
out these components wake events think
hey NXP you know hey XYZ that we want to
wake up could be also a visual event to
trigger that in the future and the
current version is focused on
conversation as I mentioned and I am
trying to go fast here. I apologize it's
uh just you know obviously time limited
but happy to go deeper on any of these
uh topics in the future. A quick demo of
what we're doing with uh with um this
rag engine and why it's so valuable
especially for contexts like medical is
it can actually have answers tailored to
data that is grounded in factuality,
grounded in truth and relevant to just
domain. So this is using an older bigger
model. We've actually I think had the
response time a lot faster with this.
You can see the text being generated.
We've introduced a streaming mode so you
don't have to wait for all the tokens to
be to be produced. you can actually
start producing earlier tokens faster.
conversational. But there you get the
TTS. That's the full that's kind of the
full use case of this. I think what we
can power the JUI plus the voice UI
powered by LMS, which we didn't have
before. It brings a lot of capability.
Closer look at the numbers here. Uh I
think where we were before and using
only CPU, it doesn't make sense. You
know, people read around just faster
than 510 tokens per second, but that 1.5
second delay just isn't natural. Isn't
suitable for conversational AI. Of
course, we can throttle that heavily
when we start focusing on both using a
neutron imp. So, that of course actually
accelerates greatly time to first token,
but achieves meaningful token speed. Um,
just for reference, this would be six
Cortex A CPUs versus a single NPU. We're
getting actually the performance of
about six, you know, high performance
Cortex A cores out of our Neutron MPU on
the IMX 95. Pretty cool.
As I mentioned at the start, this is
available today. We'd love for you guys
to start playing with, you know, the
voice UI that can be built with Jenny
Flow today and future engines coming
coming out to on uh later in this year.
The rack database generator is also a
unique tool that you can create from
kind of PDF in database out. I think
that's super effective to get a sense of
the quality um you know of of of how
effective this process is for
fine-tuning answers, but also the
efficiency because you know relative to
the size of an LLM, these databases are
designed for the edge and quite small.
I'll wrap up with a quick glimpse of
like I said the future before hopefully
time for a couple questions here. Um you
know we've made a a a big move in in
intending to acquire Canar which is a
MPU that's you know dedicated to kind of
the latest and greatest workloads and
and with quite efficient power um which
fit for the edge while expanding nicely
on the host SOC's which of course NXP is
our bread and butter. So we think this
is a really great uh union of of
technologies and teams and commercial
focus as well. And to give you a sense
of that as well, one space I focus on
might be an unsexy industry, but it's
industrial. And I think industrial is
ripe for geni innovation for reasons
like we see here with using um you know
uh visual, you know, both vision
transformer plus multimodal LLM to
understand what's happening in in a
series of
images. Um I won't go through all of
this here, but you can get I think this
is a great example of the kind of visual
intelligence that could be layered with
an agent for example to then notify
emergency services, notify supervisor
and actually take an action. So going
from perception to perception plus
action is a big theme for us with
Jenning at the edge. That's a quick so
quick summary. We can bring domain
specific intelligence to edge. These
elements can be optimized and deployed.
They can also be fine-tuned and then
leveraging the efficient acceleration we
have in the IMX family and integrated or
discrete MPUs plus this geni flow which
really serves as a one-stop shop. That's
how we bring GI to life in NXP today. Uh
thanks again for your time everyone and
hopefully just seen a few questions.
Thank you Davis for this great
presentation. There's a lot of material
to digest
today. I have a question from my
colleague at CMC James Miller. Uh is
there anything in the stack that helps
real time system developers ensure
timely bounded results? for example,
guaranteed critical responses within 50
milliseconds for safety in applications
like robotics and automotive. Yeah. So
for automotive uh I would call it so EIQ
is what we is AI space qualified. Um so
I think that there are some components
of that that help with these
deterministic requirements. One problem
of course with LM and AI in general is
that it has a stoastic or probabistic
nature. So when it comes to just LLM
throughput or these models uh having a hard cut off on their responses might be
hard cut off on their responses might be a little trickier. But for things like
a little trickier. But for things like the infotainment system automotive,
the infotainment system automotive, we've already deployed LMS at at
we've already deployed LMS at at reasonable conversation speeds. So I
reasonable conversation speeds. So I think for that kind of stuff, we can
think for that kind of stuff, we can already see a lot of innovation
already see a lot of innovation happening with the kinds of applications
happening with the kinds of applications that have these harsh requirements. You
that have these harsh requirements. You need a provider automotive grade uh AI
need a provider automotive grade uh AI and hardware stack to meet those needs.
and hardware stack to meet those needs. Yeah, thank you Davis. See you in the
Yeah, thank you Davis. See you in the panel. Thanks. Our next speaker, our
panel. Thanks. Our next speaker, our next speaker is Professor Warren Gross,
next speaker is Professor Warren Gross, a James Miguel professor and chair of
a James Miguel professor and chair of the department of electrical and
the department of electrical and computer engineering at Miguel
computer engineering at Miguel University. Warren's research bridges
University. Warren's research bridges algorithm and hardware with a focus on
algorithm and hardware with a focus on efficient deep learning models and
efficient deep learning models and hardware for machine learning. Today
hardware for machine learning. Today he'll present on parameter efficient
he'll present on parameter efficient fine-tuning of transformer-based
fine-tuning of transformer-based language model using data set pruning.
language model using data set pruning. Please join me in welcoming Warren. Why
Please join me in welcoming Warren. Why don't the stage is yours? Thank you very
don't the stage is yours? Thank you very much. Uh and it's a pleasure to be able
much. Uh and it's a pleasure to be able to speak again at this accelerating AI
to speak again at this accelerating AI workshop. This is
workshop. This is uh not the first time I've been here and
uh not the first time I've been here and I always enjoy the interactions and the
I always enjoy the interactions and the talks at this workshop. Uh can you all
talks at this workshop. Uh can you all see my slides? Okay. Yes. And we hear
see my slides? Okay. Yes. And we hear fine. Thank you. Okay. Great. So what
fine. Thank you. Okay. Great. So what I'm going to do today is talk about um
I'm going to do today is talk about um fine-tuning language models. following
fine-tuning language models. following on what Davis was talking about and I'm
on what Davis was talking about and I'm going to talk about some things that you
going to talk about some things that you can do in the U training process to make
can do in the U training process to make the um
the um uh the finetuning more
efficient. So we're we're talking about LLMs uh language models. Uh there was
LLMs uh language models. Uh there was great introductions to this area and
great introductions to this area and transformers in in the last two talks.
transformers in in the last two talks. Uh, I just wanted to show you um
Uh, I just wanted to show you um something that I found online uh as of
something that I found online uh as of late last year, some of the
late last year, some of the state-of-the-art LLMs. And what we're
state-of-the-art LLMs. And what we're seeing is that now in terms of the
seeing is that now in terms of the number of parameters in these LLMs,
number of parameters in these LLMs, we're talking about trillion plus
we're talking about trillion plus parameters. Uh these are absolutely
parameters. Uh these are absolutely enormous. And you know, looking around
enormous. And you know, looking around trying to find information about the
trying to find information about the training cost of these models, you see
training cost of these models, you see that you really need upwards of $100
that you really need upwards of $100 million to train uh one of these large
million to train uh one of these large language models from scratch. Now,
language models from scratch. Now, things changed very uh dramatically uh
things changed very uh dramatically uh at the end of the year when we saw Deep
at the end of the year when we saw Deep Seek uh which has um uh a training cost
Seek uh which has um uh a training cost uh dramatically lower of about $6
uh dramatically lower of about $6 million. uh and this has you know also
million. uh and this has you know also though it does have a lot of parameters
though it does have a lot of parameters but it's bit but smaller at 671 billion.
but it's bit but smaller at 671 billion. So there there are things that we can do
So there there are things that we can do in model design and we and also clever
in model design and we and also clever training uh to reduce this training cost
training uh to reduce this training cost but there there needs to be still uh
but there there needs to be still uh more attention paid to the efficiency of
more attention paid to the efficiency of training to bring it down even further.
training to bring it down even further. Looking at the trends in transformer
Looking at the trends in transformer size uh in terms of number of parameters
size uh in terms of number of parameters what we're seeing is an exponential
what we're seeing is an exponential increase in model size uh at the rate of
increase in model size uh at the rate of about 10 times per year. Uh this is a
about 10 times per year. Uh this is a very significant increase. And what we
very significant increase. And what we find is that when you have bigger and
find is that when you have bigger and bigger models, you also actually need
bigger models, you also actually need more and more data to accurately train
more and more data to accurately train the models. Uh and so there's really two
the models. Uh and so there's really two pieces to this. There's the how do you
pieces to this. There's the how do you efficiently train and and consider the
efficiently train and and consider the the hardware uh complexity and the and
the hardware uh complexity and the and the hardware you're training on. But
the hardware you're training on. But also there's a data set piece to this.
also there's a data set piece to this. How do you um uh consider the data sets
How do you um uh consider the data sets that we're we're going to talk about
that we're we're going to talk about both parts of this in in today's talk.
So the the way that training uh uh the training
that training uh uh the training challenge is addressed in large language
challenge is addressed in large language models is really broken down to a
models is really broken down to a two-stage process. The first stage is
two-stage process. The first stage is pre-training and that's the expensive
pre-training and that's the expensive stage. This is when you train from
stage. This is when you train from scratch on a huge data set, a
scratch on a huge data set, a pre-training data set. Uh and this is
pre-training data set. Uh and this is what would take the millions and
what would take the millions and millions of of dollars. uh and is really
millions of of dollars. uh and is really only something that can be done by large
only something that can be done by large large uh large companies that have
large uh large companies that have access to the large data set but also to
access to the large data set but also to the computational resources needed to do
the computational resources needed to do it. Um but then what you can do once you
it. Um but then what you can do once you have this general large uh pre-trained
have this general large uh pre-trained model, you can then fine-tune the model
model, you can then fine-tune the model in a second stage of training to a
in a second stage of training to a specific data set and for a specific
specific data set and for a specific task. uh and this finetuning data set is
task. uh and this finetuning data set is usually smaller and uh the finetuned
usually smaller and uh the finetuned model is then is then uh adapted to
model is then is then uh adapted to solving a particular
solving a particular task. In this talk, we're going to focus
task. In this talk, we're going to focus on the fine-tuning stage uh of this
on the fine-tuning stage uh of this process given uh an existing pre-trained
model. And so I wanted to talk about two key metrics, two key hardware metrics in
key metrics, two key hardware metrics in the training process. One will be the
the training process. One will be the training time, the amount of time it
training time, the amount of time it takes to train from start to end of the
takes to train from start to end of the process and the peak memory usage. So
process and the peak memory usage. So why is training time important? Because
why is training time important? Because it directly influences uh metrics of of
it directly influences uh metrics of of interest to everyone. The for example
interest to everyone. The for example energy usage. So the longer it takes to
energy usage. So the longer it takes to train, the more energy the process will
train, the more energy the process will take and the energy usage directly
take and the energy usage directly impacts the electricity bill or the
impacts the electricity bill or the battery life of the device as well as
battery life of the device as well as the carbon footprint.
the carbon footprint. On the other hand, the peak memory usage
On the other hand, the peak memory usage determines the minimum amount of memory
determines the minimum amount of memory you need to uh uh allocate on your
you need to uh uh allocate on your device uh such as a GPU or an NPU. This
device uh such as a GPU or an NPU. This will lead to more expensive devices and
will lead to more expensive devices and since you need many devices, this could
since you need many devices, this could be very significant cost as well as
be very significant cost as well as training difficulty. If your model
training difficulty. If your model doesn't fit in the memory of your GPU,
doesn't fit in the memory of your GPU, for example, then you may need to
for example, then you may need to partition the training process, move
partition the training process, move data in and out, and it it it can
data in and out, and it it it can complicate the training process.
complicate the training process. The other key aspect of memory is the
The other key aspect of memory is the number of memory accesses. And this
number of memory accesses. And this actually has a very strong uh influence
actually has a very strong uh influence on energy usage. So these are not
on energy usage. So these are not completely independent things, but
completely independent things, but they're both key metrics we want to
they're both key metrics we want to decrease. We want to have faster
decrease. We want to have faster training and using less memory.
So I'm going to introduce this uh picture on the right here which we'll
picture on the right here which we'll use throughout the talk to show the
use throughout the talk to show the effect of the different uh innovations
effect of the different uh innovations uh that can be applied to to the model
uh that can be applied to to the model training. So on the vertical axis we'll
training. So on the vertical axis we'll plot peak memory usage and the training
plot peak memory usage and the training time on the horizontal axis. Uh when you
time on the horizontal axis. Uh when you compare pre-training and fine-tuning you
compare pre-training and fine-tuning you see that fine-tuning uh has a much
see that fine-tuning uh has a much smaller training time than pre-training.
smaller training time than pre-training. But you'll notice that they both use the
But you'll notice that they both use the same amount of peak memory because we we
same amount of peak memory because we we haven't changed the
haven't changed the model. What we want to do is find
model. What we want to do is find techniques to move towards the bottom
techniques to move towards the bottom left of this uh of this graph to have
left of this uh of this graph to have low peak memory usage and lower training
low peak memory usage and lower training time. And we want to do all of this
time. And we want to do all of this without negatively affecting the model
accuracy. So the first thing we can do is use a technique called parameter
is use a technique called parameter efficient finetuning. And the basic idea
efficient finetuning. And the basic idea there is to avoid updating every single
there is to avoid updating every single model parameter when you're fine-tuning.
model parameter when you're fine-tuning. You can freeze large parts of the model
You can freeze large parts of the model weights and not actually update them and
weights and not actually update them and only focus on a subset of the weights um
only focus on a subset of the weights um uh in the fine-tuning. And this has a a
uh in the fine-tuning. And this has a a a couple of of uh of benefits. So if you
a couple of of uh of benefits. So if you look on the left, this is the normal uh
look on the left, this is the normal uh operation of training. First you do a
operation of training. First you do a forward pass, then you compute gradients
forward pass, then you compute gradients uh and then you update the weights using
uh and then you update the weights using those gradients. And the gradients and
those gradients. And the gradients and the weights happen to be moved in and
the weights happen to be moved in and out of memory uh uh which uh increases
out of memory uh uh which uh increases the amount of memory you need. Well, it
the amount of memory you need. Well, it sets the amount of memory you need and
sets the amount of memory you need and the time it takes to perform this
the time it takes to perform this training. In the memory are stored the
training. In the memory are stored the weights, the gradients and other uh
weights, the gradients and other uh optimizer states. Now in a uh frozen
optimizer states. Now in a uh frozen model you still have to perform the
model you still have to perform the forward pass but then you for a large
forward pass but then you for a large portion of the the parameters in the
portion of the the parameters in the model the frozen parameters you don't
model the frozen parameters you don't actually have to compute the gradients
actually have to compute the gradients or update the weights which means now
or update the weights which means now I'm going to reduce the amount of data
I'm going to reduce the amount of data that has to go back and forth between
that has to go back and forth between the compute and the memory I don't have
the compute and the memory I don't have to store uh all the gradients and the
to store uh all the gradients and the states so this is going to reduce the
states so this is going to reduce the peak memory uh quite substantially will
peak memory uh quite substantially will it reduce the time taken the training a
it reduce the time taken the training a little bit, but since you still have to
little bit, but since you still have to do the forward pass, the reduction in
do the forward pass, the reduction in time is not really significant. It's
time is not really significant. It's really the memory savings that you're
really the memory savings that you're you're achieving
you're achieving here. And the most popular way to do
here. And the most popular way to do this is a technique called Laura, low
this is a technique called Laura, low rank adaptation. And there you freeze
rank adaptation. And there you freeze most of the model. And and it's easiest
most of the model. And and it's easiest to think of freezing models in terms of
to think of freezing models in terms of freezing layers, but some of the layers,
freezing layers, but some of the layers, for example, the attention layers in a
for example, the attention layers in a transformer will be the ones that are
transformer will be the ones that are are are fine-tuned. Um and when you look
are are fine-tuned. Um and when you look at a layer that is going to be um not
at a layer that is going to be um not frozen, what you what you do is you take
frozen, what you what you do is you take the the linear layer which is the m byn
the the linear layer which is the m byn matrix and you actually freeze that and
matrix and you actually freeze that and add additional parameters in parallel to
add additional parameters in parallel to that linear layer. Um so it does involve
that linear layer. Um so it does involve adding a few extra parameters but this
adding a few extra parameters but this not too many because uh this parallel
not too many because uh this parallel layer is an n by r concatenated with an
layer is an n by r concatenated with an r byn matrix where r is a very very
r byn matrix where r is a very very small number. So you have a small
small number. So you have a small additional number of parameters, but in
additional number of parameters, but in terms of training, it makes it much much
terms of training, it makes it much much much easier. And in inference, you're
much easier. And in inference, you're going to use both of these layers. So
going to use both of these layers. So this will result in fewer memory
this will result in fewer memory accesses and smaller peak
accesses and smaller peak memory. And what is the effect of doing
memory. And what is the effect of doing uh parameter efficient finetuning like
uh parameter efficient finetuning like Laura? Well, the peak memory is
Laura? Well, the peak memory is considerably reduced. So you can see on
considerably reduced. So you can see on this graph that uh compared to standard
this graph that uh compared to standard finetuning, finetuning with Laura
finetuning, finetuning with Laura combined reduces the peak memory. the
combined reduces the peak memory. the training time is also slightly reduced
training time is also slightly reduced but not by a lot. Uh so how now we we
but not by a lot. Uh so how now we we we're asking the question how do we
we're asking the question how do we further decrease the training
further decrease the training time and to do that we need to look at
time and to do that we need to look at the other part of the equation which is
the other part of the equation which is the data that you're training on and
the data that you're training on and when you look at a data set which is a
when you look at a data set which is a collection of of data samples. You can
collection of of data samples. You can realize that not all the samples in the
realize that not all the samples in the data set are are equally helpful. Some
data set are are equally helpful. Some of them are not helpful at all. For
of them are not helpful at all. For example, some data points may be
example, some data points may be mislabeled and they can actually be
mislabeled and they can actually be misleading to the model. It may actually
misleading to the model. It may actually hurt your model if you train on them.
hurt your model if you train on them. Some of them are very very easy. We call
Some of them are very very easy. We call easy data points where they don't really
easy data points where they don't really add any information to the model that
add any information to the model that doesn't already exist in the pre-trained
doesn't already exist in the pre-trained model. On the other hand, some are very
model. On the other hand, some are very very difficult and if you train on them,
very difficult and if you train on them, it can actually lead to bad things and
it can actually lead to bad things and damage your model. So, we really would
damage your model. So, we really would like to not train on those kinds of data
like to not train on those kinds of data points. They're unwanted. And if we can
points. They're unwanted. And if we can identify them and prune them away, what
identify them and prune them away, what we can do is have a fine-tuning model
we can do is have a fine-tuning model that is now more accurate. But also if
that is now more accurate. But also if you think about it if we don't train on
you think about it if we don't train on on certain number of of data points then
on certain number of of data points then it's going to be faster to train. So it
it's going to be faster to train. So it has the the dual benefit of of uh
has the the dual benefit of of uh benefiting the accuracy potentially but
benefiting the accuracy potentially but also reducing the training time. That's
also reducing the training time. That's the training time is really the the
the training time is really the the aspect that we want to look at
aspect that we want to look at here. So in data step pruding you want
here. So in data step pruding you want to find which data points you don't uh
to find which data points you don't uh want to train on by evaluating some
want to train on by evaluating some score function. And so uh we've looked
score function. And so uh we've looked at at uh the design of score functions
at at uh the design of score functions to look at each data point in the uh in
to look at each data point in the uh in the data set and see if we can prune it
away. And our uh score function we call the h score and it works kind of like
the h score and it works kind of like this. So you do some training some
this. So you do some training some finetuning for a few epics and you look
finetuning for a few epics and you look whether the classification for each data
whether the classification for each data point was correct or not. So we look at
point was correct or not. So we look at the ground truth and we see did the
the ground truth and we see did the model correctly classify or not. And if
model correctly classify or not. And if the the model is correctly classifying
the the model is correctly classifying across all epics consistently then we
across all epics consistently then we give uh that data point a score of one
give uh that data point a score of one and then we do this multiple times for
and then we do this multiple times for different seeds or six seeds and we just
different seeds or six seeds and we just add the score for every seed. So what we
add the score for every seed. So what we end up with is a score for data points
end up with is a score for data points that are consistently always being
that are consistently always being correctly trained uh uh giving the
correctly trained uh uh giving the correct answer then you could have a
correct answer then you could have a score of six. On the other hand, if you
score of six. On the other hand, if you have data points that are always
have data points that are always consistently giving the wrong
consistently giving the wrong classification, then you have a score of
classification, then you have a score of zero. And so you can have scores between
zero. And so you can have scores between zero and six. If we look at the scores
zero and six. If we look at the scores of uh data points with score six, these
of uh data points with score six, these are really, really, really easy data
are really, really, really easy data points. They're always getting the right
points. They're always getting the right answer. The model probably already knew
answer. The model probably already knew the answer. So you don't need those. You
the answer. So you don't need those. You can prune those away. Scores that are
can prune those away. Scores that are zero is very difficult. You don't want
zero is very difficult. You don't want those there. So you prune those away.
those there. So you prune those away. And in the middle, you have scores that
And in the middle, you have scores that are ambiguous. These are the data points
are ambiguous. These are the data points we keep and train
on. So the training time is now reduced proportional to the size of the prune
proportional to the size of the prune subset. In our experiments, we're
subset. In our experiments, we're pruning away 70 to 80% of the data set.
pruning away 70 to 80% of the data set. So we're left with maybe 20 or 30% uh of
So we're left with maybe 20 or 30% uh of the original data set. So this can have
the original data set. So this can have significant decreases in training time.
significant decreases in training time. So oops, you can see now that um we've
So oops, you can see now that um we've now talked about two two methods. the
now talked about two two methods. the Laura which can reduce the peak memory
Laura which can reduce the peak memory usage and uh data set pruding using H
usage and uh data set pruding using H core which can reduce the training time.
core which can reduce the training time. What we'd like to do now is see can we
What we'd like to do now is see can we combine both of these techniques to
combine both of these techniques to drive us uh closer to the the bottom
drive us uh closer to the the bottom left to have low memory usage and low
left to have low memory usage and low training time. And so the proposed
training time. And so the proposed method does both. You take the
method does both. You take the pre-trained model apply low rank
pre-trained model apply low rank adaptation to come up with a parameter
adaptation to come up with a parameter efficient model. That model is then used
efficient model. That model is then used with the fine-tuning data set to compute
with the fine-tuning data set to compute an H score. Now that I have an H score,
an H score. Now that I have an H score, I can do data set pruning. I have a
I can do data set pruning. I have a prune data set applied to the lower
prune data set applied to the lower model. I can fine-tune to come up with
model. I can fine-tune to come up with my fine-tuned model which I can then
my fine-tuned model which I can then evaluate. So this is the results of of
evaluate. So this is the results of of the evaluation of these um of these two
the evaluation of these um of these two techniques combined. And we have here uh
techniques combined. And we have here uh combinations of one or the other
combinations of one or the other technique and then see the both of them
technique and then see the both of them together. Um and so compared to the the
together. Um and so compared to the the baseline which we'll just say uh has a
baseline which we'll just say uh has a speed up of of one normalized and um the
speed up of of one normalized and um the peak memory of about 10 gigs. The data
peak memory of about 10 gigs. The data the Laura by itself has limited speed up
the Laura by itself has limited speed up 1.2 times but a significant reduction in
1.2 times but a significant reduction in peak memory usage. the data set pruning
peak memory usage. the data set pruning using the hcore by itself has
using the hcore by itself has significant speed up of of over four
significant speed up of of over four times but of course doesn't reduce the
times but of course doesn't reduce the uh peak memory. So as we hypothesized
uh peak memory. So as we hypothesized the experiments showed that when you
the experiments showed that when you combine both of these techniques
combine both of these techniques together on average you could have over
together on average you could have over five times speed up and also enjoy the
five times speed up and also enjoy the significant compression uh of memory.
Now looking at accuracy uh what we've done here is we've looked at these two
done here is we've looked at these two techniques both in individually and
techniques both in individually and combined and we've also added uh
combined and we've also added uh comparison with random pruning not using
comparison with random pruning not using hcore and the reason why we include
hcore and the reason why we include random pruning is it's actually in the
random pruning is it's actually in the regime where we have significant data
regime where we have significant data set pruning of 80% or or more random is
set pruning of 80% or or more random is actually state-of-the-art. It's actually
actually state-of-the-art. It's actually better than the other uh scoring
better than the other uh scoring functions that have been posed. Maybe
functions that have been posed. Maybe when you're doing less aggressive
when you're doing less aggressive compression, there are other techniques
compression, there are other techniques that can be used. But in this highly
that can be used. But in this highly aggressive regime, uh random is is very
aggressive regime, uh random is is very good. So the question is does H4 do
good. So the question is does H4 do better? And it does. And in fact, it's
better? And it does. And in fact, it's necessary to get uh excellent
necessary to get uh excellent performance. So what we can see here is
performance. So what we can see here is that the accuracy um overall is actually
that the accuracy um overall is actually improved slightly by using H4 pruning.
improved slightly by using H4 pruning. Um and Laura helps as well. So the
Um and Laura helps as well. So the combination of the two is very effective
combination of the two is very effective and it either doesn't hurt or slightly
and it either doesn't hurt or slightly improves because of regularization
improves because of regularization affects the uh accuracy on a model like
affects the uh accuracy on a model like Roberto large which has 350 and so
Roberto large which has 350 and so million
million parameters. Finally I just wanted to
parameters. Finally I just wanted to introduce uh one additional uh set of
introduce uh one additional uh set of experiments on continual learning. So
experiments on continual learning. So what is continual learning? It's the
what is continual learning? It's the scenario where I'm going to fine-tune on
scenario where I'm going to fine-tune on two different tasks consecutively. So, I
two different tasks consecutively. So, I have a pre-trained model and I'll
have a pre-trained model and I'll fine-tune it to to task one and I was I
fine-tune it to to task one and I was I I'll end up with a model. Then I'll take
I'll end up with a model. Then I'll take that model, that fine-tune model and
that model, that fine-tune model and fine-tune on a different task, task
fine-tune on a different task, task number two. Now, the problem is now I
number two. Now, the problem is now I have a model that's been fine-tuned
have a model that's been fine-tuned twice. And what can tend to happen is
twice. And what can tend to happen is that when you fine-tune the second time,
that when you fine-tune the second time, the model forgets how to do the first uh
the model forgets how to do the first uh task. It can be damaged. And so we want
task. It can be damaged. And so we want to evaluate
to evaluate um do these techniques of data set
um do these techniques of data set pruning and
pruning and Laura help mitigate the forgetting in
Laura help mitigate the forgetting in the of the first model when when
the of the first model when when fine-tuning more than once and and the
fine-tuning more than once and and the answer is yes it does. And so what what
answer is yes it does. And so what what we'll do is we'll evaluate this
we'll do is we'll evaluate this fine-tuned model the final one that's
fine-tuned model the final one that's been trained on task one and two and
been trained on task one and two and we'll we'll evaluate them on both tasks
we'll we'll evaluate them on both tasks and we'll see the effect of of uh
and we'll see the effect of of uh applying these techniques.
applying these techniques. So you can see in the first line where
So you can see in the first line where there's there's uh no modification just
there's there's uh no modification just there's two scenarios one where we task
there's two scenarios one where we task number one is called MNLI and then we
number one is called MNLI and then we train it on QNLI and there's also this
train it on QNLI and there's also this other scenarios C2MDB
other scenarios C2MDB uh you can see that the the first model
uh you can see that the the first model the first task MNLI in this case is
the first task MNLI in this case is dramatically damaged by fine-tuning on
dramatically damaged by fine-tuning on QNLI so what we can see then is by
QNLI so what we can see then is by applying Laura you can get back some of
applying Laura you can get back some of the for forgetting. So you can mitigate
the for forgetting. So you can mitigate some of the damage but not a lot.
some of the damage but not a lot. Applying data set pruning then is key to
Applying data set pruning then is key to actually rebuild the performance on that
actually rebuild the performance on that first task. Uh and compared to random uh
first task. Uh and compared to random uh we find the age score actually does u
we find the age score actually does u does a little bit better. And so it it
does a little bit better. And so it it really uh shows us that the combination
really uh shows us that the combination of these two techniques not only gives
of these two techniques not only gives you um uh more efficient fine-tuning in
you um uh more efficient fine-tuning in terms of memory reduction and training
terms of memory reduction and training time but also can help in continual
time but also can help in continual learning scenarios by mitigating damage
learning scenarios by mitigating damage to the the first uh task.
to the the first uh task. So in conclusion, we've have looked at
So in conclusion, we've have looked at ways to reduce the peak memory usage and
ways to reduce the peak memory usage and training time of fine-tuning in uh data
training time of fine-tuning in uh data large language models. Uh the first
large language models. Uh the first technique Laura which is one that's not
technique Laura which is one that's not ours but we we've evaluated it uh is
ours but we we've evaluated it uh is very effective in mainly reducing the
very effective in mainly reducing the memory usage and our proposed data set
memory usage and our proposed data set pruning using Hore greatly reduces
pruning using Hore greatly reduces training time. So what we've done is
training time. So what we've done is combined both of these together and
combined both of these together and shown that it's very effective. We
shown that it's very effective. We receive a over five times speed up and a
receive a over five times speed up and a 40% peak memory reduction and uh I've
40% peak memory reduction and uh I've also shown that these two techniques in
also shown that these two techniques in combination are very effective in
combination are very effective in mitigating the forgetting uh of the
mitigating the forgetting uh of the first task in a continual learning
first task in a continual learning setup. So that's my presentation and I
setup. So that's my presentation and I want to thank you very much for uh for
want to thank you very much for uh for your attention.
your attention. Thank you Warren for this great
Thank you Warren for this great presentation and great research. Um I
presentation and great research. Um I think this is a much needed feature
think this is a much needed feature going forward to reduce the training
going forward to reduce the training time and the memory usage. This will
time and the memory usage. This will also reduce power in data center when we
also reduce power in data center when we train models. Right. So what are uh I I
train models. Right. So what are uh I I think this is not a technical question.
think this is not a technical question. What are the key tools and hardware
What are the key tools and hardware required to enable research in parameter
required to enable research in parameter efficient finetuning and data set
efficient finetuning and data set pruning and what are the main pain
pruning and what are the main pain points uh you face in advancing this
points uh you face in advancing this field? Right. So the that's a good
field? Right. So the that's a good question. Thank you very much. So the
question. Thank you very much. So the main um bottleneck in terms of tools for
main um bottleneck in terms of tools for this kind of research is available
this kind of research is available computational resources. So getting
computational resources. So getting enough GPUs or NPUs we using GPUs to do
enough GPUs or NPUs we using GPUs to do the training is um is very significant
the training is um is very significant uh bottleneck especially because in data
uh bottleneck especially because in data set pruning in order to compute the um h
set pruning in order to compute the um h score we need to do multiple training.
score we need to do multiple training. So this is this is um uh quite a pain
So this is this is um uh quite a pain point. And um the second uh challenge we
point. And um the second uh challenge we have is partly related to the resources
have is partly related to the resources we have and also partly related just to
we have and also partly related just to coming up with good techniques is how to
coming up with good techniques is how to scale this to very large language models
scale this to very large language models which we have not done yet although in
which we have not done yet although in this current ongoing work we're looking
this current ongoing work we're looking at how to apply this to much larger
at how to apply this to much larger class LLM. Yeah, this was my next
class LLM. Yeah, this was my next question. Thank you. Okay, thank you
question. Thank you. Okay, thank you Warren. Um, due time limitation, we're
Warren. Um, due time limitation, we're going to uh go to our next speaker, but
going to uh go to our next speaker, but please do not hesitate to ask uh the
please do not hesitate to ask uh the speakers question directly in the chat.
speakers question directly in the chat. They will be pleased to answer them. Up
They will be pleased to answer them. Up next is Borak Kmak, CTO of Edge Signal.
next is Borak Kmak, CTO of Edge Signal. With over 18 years of experience leading
With over 18 years of experience leading global products development in edge
global products development in edge computing and cyber security, Borak
computing and cyber security, Borak brings both technical and strategic
brings both technical and strategic insights. He'll be speaking on uh
insights. He'll be speaking on uh implementing generative AI in the edge
implementing generative AI in the edge environments, challenges and solutions.
environments, challenges and solutions. Let's hear from Borak. Borak, the stage
Let's hear from Borak. Borak, the stage is yours. Thank you so much, Yasin. So,
is yours. Thank you so much, Yasin. So, let me share my
let me share my screen. Is that visible now? It's
screen. Is that visible now? It's visible and we hear you fine. Thank you.
visible and we hear you fine. Thank you. Okay. Uh
Okay. Uh so yeah being the last uh speaker of
so yeah being the last uh speaker of this first section uh it's going to be
this first section uh it's going to be kind of repetitive because the things
kind of repetitive because the things that I want to discuss already some of
that I want to discuss already some of them are uh you know
them are uh you know uh well presented by the prior uh
uh well presented by the prior uh speakers but I will skip those parts a
speakers but I will skip those parts a little bit faster and come up to the
little bit faster and come up to the real challenges that we are coming
real challenges that we are coming uh like currently they facing in the
uh like currently they facing in the customer
customer environments.
environments. Uh
Uh so here uh the reason we wanted to take
so here uh the reason we wanted to take advantage of edge llm was
advantage of edge llm was uh pretty you know obvious. Of course,