YouTube Transcript:
Building Frontier AI Products with Fin x Cognition x Harvey AI x Perplexity
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
View:
[Music]
that I had moved on
that I had moved on. [Music]
a a [Music]
[Music] Oh,
Oh, [Music]
[Music]
I had moved on. Our head.
Okay, welcome everyone. Thank you so
much for joining us online and in person
here in San Francisco. You want to come
start taking your seats? We'll get
started. Uh we'll just get started in a
minute. Um we've got a great evening for
you tonight. We've got talks and
technical presentations and a panel
discussion with leaders from Harvey,
Cognition, Perplexity, and Finn. all on
the theme of building great frontier AI
products. Um, I'm Jordan Neil, SVP of
engineering from Intercom. I'm going to
be your MC tonight. But to get us
started, I'm going to intro uh
Thank you, Jordan. And good evening to
everyone. Welcome to all of you here in
SF and to the thousands on the live
stream on the live stream as well. I'm
Dez uh co-founder of Intercom. We're the
company behind Finn. I'm guessing you
gathered that or I'm hoping you've got
that by now. Um we have a really fun
evening ahead. We've got pretty much the
entire uh leadership team of intercom,
CEO, CPO, CT, all the CES basically are
here along with me. Um we wanted to
gather a group of great companies and
great people to talk about frontier AI
products. What really what we want to
talk about is like what it means to be
working on the actual edge, like really
pushing things forward. This wave of AI
that we're in is still kind of quite
young and it's very fast moving.
Companies are blowing up in like the
good way and blowing up in the bad way,
too. Um, for us, we're betting really
hard on AI. Uh, it's basically the
future of our entire business. Finn at
its pretty young age already has about
over 6,000 like paying happy customers.
It resolves over a million conversations
a week. By all the data we have, it's
the highest performing agent that's out
there. And it's also the fastest growing
thing that any of us in the company have
ever worked on, ever by a massive
margin. So yeah, we're all locked in on
AI. We're all locked in on Finn. When we
look at the leading AI companies, of
which we'd humbly submit Finn into, what
we see is this like innovation at all of
the levels, right? In software, so many
of us are so used to this world where
like if you're working on something, it
shows up in the product as UI that you
can look at and point at and click and
play with. We're far less used to
talking about these like subterranean
improvements, these ways in which the
product gets a lot better, but you don't
see anything change. What happens is the
users realize after, you know, one
update or whatever, this thing's
working really well now. And those
improvements tend to be slightly more
invisible because they're happening at
the AI layer or any of the layers
beneath. That's where so much of the
magic of products like Finn, Harvey,
Devon, Perplexity, that's where it
happens. It's at this AI layer through
optimizations, through rearchitectures
or even deeper again at the actual model
layer itself. That's where groups like
our AI group of which is about you know
we've about 50 strong. There's a lot of
the folks here who are at doing the
poster sessions uh are part of it as
well. That's where they spend their
time, right down at all the levels,
finding all of the edges, all of the
ways to make a truly great AI product.
So, as I said, we wanted to pull
together a great group of people, a
great group of companies and a crowd to
basically talk about what it means to go
further, to go harder, and to go deeper
when you're building AI to really push
the envelope. That's what tonight's
about. And to kick us off with an
opening keynote, I'm really excited to
hand over to our chief AI officer, Mr.
Thanks very much Dez and thank you so
much everybody for coming here today. Uh
so my name is Fergle um said I kind of
head up AI at intercom and I'm here to
talk about creating value at the AI
layer and kind of share how we think
about this important topic because you
know AI product strategy is hard right
it's a very dynamic space the rules
change all the time you think you're
building something that's really
valuable and then something changes at a
layer below you and suddenly you have to
revisit all your your your your
assumptions and so you know you don't
want to spend time building things that
aren't valuable. And so, you know, we
really think about this idea that like,
hey, there's an AI layer. And you're
probably familiar with the idea of very
commonly, you know, there's an
application layer, right? There's the UI
and the UX of the the product people
use. And of course, there's the model
layer, the the LLMs deep down underneath
that tend to power things. But really,
there's a lot to do at the AI layer that
we spend a lot of time on, too. You
know, prompts, orchestration, logic,
context. And we really think it's
interesting to sort of look at different
products and kind of taxonomize them
using this lens. And you know, back when
chat GPT first came out two years ago,
there's a lot of talk about like thin
wrappers. And really the the first chat
GPT experience itself was kind of a thin
wrapper, but which I would say like the
application layer is pretty thin.
There's there's an AI layer, but it's
also thin. really you're very nakedly
talking to the model when you play with
that kind of chat GPT version one and of
course this has changed over time I
think now you know another way of
taxonomizing things or another set of
products is is sort of the AI enabled
application right and a good example of
that might be you know the early version
of co-pilot in in VS code for example
right big application that people have
been building for for decades but then
with like a pretty thin AI integration
and you know with a very small AI eye
layer above above the model there. And I
think you know as time has passed you
know two years in you're starting to see
more and more things that are are what
maybe we might call an AI native
application right where there's an
application layer and maybe it's a
little bit relatively smaller and
there's a big AI layer there's a lot of
engineering around the models and you
know Finn our product today uh would
probably fall into this category that
there's a very complex kind of rag layer
above the models that I'll talk about uh
in a minute and as AI I kind of has
matured. I think we're seeing more and
more products that look a little bit
like this. And and that's great, right?
If you're building a product, this is a
very nice clean narrative to be able to
tell. But I think something interesting
happened recently uh with clog code,
which some of you may have seen. In case
anyone hasn't seen clog code, it's sort
of a a very kind of command line
terminal-based um kind of application
that people used to write code. And it
was it was just quite interesting
because it was really the model there
kind of striking back, right? It really
looks like, you know, a a relatively
thin application and a relatively thin
AI there and the model doing a lot.
Looks a lot like the tin rappers of two
years ago. And this is kind of
interesting. And you know, claw code has
a sweet spot. It doesn't do everything,
but it's influential. Suddenly it
spawned many clones and it's quite quick
to clone because it's relatively thin at
the layers above the model. Um, and I
often think like what does someone like,
you know, Jet Brains who make idees for
like decades think when they see
something like this, right? A part of
their application surface has suddenly
been, you know, very quickly
commoditized and very quickly made thin
by something like this. I think it's
very dangerous for application companies
to to look at this sort of trend and if
this goes and so obviously we have a
thesis here, right? We think that like
you know there's generally a lot of work
to do above the model layer but we kind
of think that AI companies they can't
ignore the risk of the model layer sort
of coming up quite suddenly and you know
making some of their investment um
suddenly outdated.
So really we think that anyone building
deep AI applications you have to have a
plan to build durable value around the
models and at the AI layer. And we we'll
argue today that it's it's possible to
do this and share a little bit about
what we're investing in and why we think
this is this is something that companies
like us can deliver on. But first I want
to share a little bit about our context
and our history because it's it's kind
of important for for the direction we're
going. Uh so intercom we make a customer
support platform uh in inbox where you
know humans go and answer customer
support questions. And you know back in
2018 we started to build this product we
called resolution bot which was like you
know a previous generation I guess AI
chatbot. Um you know humans would have
to go and like settle up and configure
intents and then define like how it
would answer a question. We did a lot of
work to make this as seamless as
possible. Um you know back then our tech
stack was very uh you know BM25 meets
word tovec. This is like a slide from
way back when. H it's like a
handgineered sort of you know um
information retrieval uh wordtovec thing
um that actually wrote and you know this
used to be previous generation uh 2019
we're an early adopter of AI and I I
went and did some splunking through our
uh our GitHub for this and I think
there's one thing I want to highlight
here is um you know uh we were putting
neural networks in production we put
Muse in production pretty soon after it
came out and we really ended up building
sort of a proto
uh vector DB with like cosign similarity
and everything back then um to to kind
of to power our previous generation
bots. So we're kind of in this space and
we were well equipped when the world
changed with chat GPT um you know to to
start building Finn. So I quickly touch
on the story of Finn. So we kind of been
watching this space for a while. We this
is an internal memo I wrote about Google
Lambda which a lot of us seen at the
time and was sort of a I guess a
precursor to uh to to chat GPT and you
know we're kind of like looking at this
and watching and then suddenly chat GPT
came out and we moved very fast on it
and we built Finn which we considered
sort of a breakthrough AI agent. We
launched Finn on H
powered by GPD4 on GPT4 launch day. Um
it was rag from the start and we had
already started investing in rag um I
guess from maybe January uh 2023.
um one of the first production rag
systems in customer experience and maybe
ever you know I'm not sure and we didn't
know it was called rag at the time and
really what we invested in in 23 to 24
we invested a ton in prompt engineering
right I think like a lot of people at
the time you had to do a lot of work
with the models of 2023 in order to to
get good results out of them you do a
lot of prompt engineering we invested a
ton in in our rag pipeline in sort of
like you
uh trying out different types of
retrieval strategies, different types of
chunking, things like that. And then we
did a whole lot of testing and
optimization. And over time, through
experimentation and gradual refinement,
we ended up with an architecture that
looked a little bit like this. This is
kind of the fin architecture from a few
months ago. Um I'm not going to talk you
through it all, don't worry. But uh but
uh it's just to show that like this is
where we got to optimizing AB testing
bit by bit um to build like quite a
complex industrial strength product with
a whole lot of different pieces. you
know, we were summarizing issues, uh,
doing custom retrieval, etc., etc. And
really at that point in that sort of
2023, 2024 timeline, you know, what we
were really were investing in was this
experimental culture, always AB test
everything in production. And it was
always deeply intuitive, deeply
unintuitive, um, whether a new thing we
were trying was actually going to
improve the product or degrade it. And
so we all had to like AB test
everything. And of course we built a
bunch of product features around Corfin
kind of helping the rest of the
intercore mortgage to do that. And uh
there's a lot of pain and suffering
there as we had a big SAS company and we
were sort of like learning how to build
AI with probabilistic systems and Molly
is going to give a little bit of talk of
like pitfalls there. But but overall it
you know we're proud of where we got to.
It worked out pretty well. This is
probably the the chart that I am most
proud of um in my time um at at intercom
building Finn which is um the chart of
Finn's resolution rate over time and
very weirdly it kind of grows almost in
a mur's law style way each month we're
like we've got a bunch of things we can
try and probably it won't work probably
we're asenting out in terms of the
quality but uh but each month we we get
about on average a percentage point of
of kind of end user defined resolution improvements
improvements
And um it's working pretty well. We've
got about $50 million of ARR at the
moment just from Finn and on on a good
trajectory if it holds to to 100 million
in a couple of quarters. So you know
Finn today, as I was saying, it really
has a a sort of a mature AI layer
powering it, right? There's the models,
but then there's a ton of stuff around
the models.
But how do we take that to the next
level? And this is really what I'd like
to talk about today, which is what we've
been doing over the last sort of six
months. And again, we're wary about the
model there coming up. We really want to
spend our time building durable
differentiation. It it's easier today to
build something competitive with Finn
than it was a year ago and certainly
than it was two years ago. So what do we
do? How do we take it to the next level?
One thing we spend a lot of time
thinking about is this quote from Alan
K, right? People who are serious about
software should make their own hardware.
It's quite an influential quote in
Apple. It's quite a cool quote.
How does this apply to the AI era?
Right. And you know, we spent a lot of
time thinking about this and I think as
Dez said earlier, we really have an
emerging thesis that you've kind of got
to go quite deep into the AI layer to
build the best products. And so I'm
going to talk about the kind of the
results of our deep investment here. And
again, you know, we still do of course
use LLMs from Frontier Labs, Antropic
for Fore Excellent model, OpenAI partner
for voice. But I'd like to tell you
about some of the work we'll be doing
ourselves because we think this is we
speculate. We don't know, right? We
don't have a crystal ball, but we think
this is a template for what a lot of AI
applications will do over time. It's
going to talk at a high level about some
of the things we've been doing recently.
Um, if you want to get more technical
details, we're releasing a whole lot of
blog posts today where we're really
sharing a lot of technical information
about the work we've been doing over the
last six months. And of course, we've
got poster presentations here, too. So
one of the first things we set out to do
is is to build a custom reranker. I'm
going to share a little bit about our
journey doing that. So what's a
reranker? Um you know Finn as a a rag
application um you know goes and tries
to answer a question using a whole bunch
of chunked knowledge uh typically from
your help center or your other sources
of documents. And we've discovered over
time that the performance of an LLM and
actually answering a question is is
super sensitive to the exact set of
documents that uh that we retrieve. and
even weirdly the order in which we
retrieve them and present them. And so
building a reranker can be really
impactful on the performance of the
system overall. And so you know we've
ended up building our own customary
ranker over the last six months using
modern birth as a building block and uh
training it on a whole bunch of data
from Finn. And I guess one thing we're
sharing that is, you know, somewhat
surprised us is that for our use case um
our own reanker here has outperformed
the previous best-in-class ranker we're
using which is coher rank 3.5 and
improved answer quality and it reduced
our costs by a lot and this is going to
be a consistent theme this going to talk
a lot about you know how build moving to
custom models has improved performance
but also decreased cost a lot and
increased efficiency decreased latency.
We sort of think that this is an
emerging pattern that probably a lot of
people are going to do. Um some back
test results here really quite
surprisingly large gains um from from
moving to uh to to to our own model uh
trained on our own data. Um we've also
uh built a custom retrieval model over
the last while. We call it finx
retriever. Um you know again this was
initially trained on 300,000 real user
queries each of which had a hard
resolution. And a hard resolution is
when someone affirmatively says yes
thank you Finn this has positively
answered my question and um you know our
ability to have you know high performing
AI application we think gives us you
know an advantage that's proving out in
ter when it comes to training our own
models here and um our retriever model
which was a fine tune of a snowflake
model that had a pretty good uh
performance cost uh envelope um has uh
has has performed very well has
outperformed the the previous
competitive retrieval models that we
were using and I I highlight this
because I think there's something
interesting here in this sort of uh
second um box which is that we compared
here we did an experiment where we said
like hey how good is the retrieval model
you know for the applications it's
trained on versus how good is it cross
application you know what's it learning
in terms of learning to be a retrieval
model for customer experience generally
and you Obviously within application
it's the best but it also does
generalize pretty well out of sample
across apps which was uh better better
than we expected. Um you know I guess
the theme here is that you know there's
still a lot of room uh for optimizing
and for improving by training on on very
broad um you know application specific
data. We've also built a custom issue
summary model that powers a part of Finn
called Finn CX summary. I think this is
a good example of the emerging
hypothesis of like small the small LM
hypothesis essentially that small
language models can sometimes perform
pretty well h where when trained for a
specific task Nvidia wrote a paper about
it recently that's been quite
influential. So you know within Finn one
thing we've always done is we have
summarized the end user's query before
going to do the rag thing because
sometimes end users ask very uh very
strange things in you know they they can
have a lot of noise and a lot of
weirdness before they actually ask a
question and exposing that raw to your
rag system it has not been as performant
for us as first like canonicalizing that
query and so we've always had an issue
summary layer and this has been our
attempt to to kind of build our own
model. We had we had an LLM typically
used GPT 4.1 for that as a kind of a
fast performant model. Um but we we want
to experiment with uh with improving it.
And one thing that's always been tricky
about issue summary is is if you say to
an LLM, hey, summarize this issue and
like there's no issue there, uh they
often will get confused. And so we've
always built sort of we've always had a
few shot approach to solving this
problem where we kind of say, "Hey, if
there's just a greeting or if it's just
a goodbye or if it's just like a
negative reaction, please don't
summarize the issue." We've kind of like
always prompted the LLM in in a few shot
way to improve its performance. But I
think as anyone who's run a production
LLM at scale for a while, this runs into
a problem. every kind of battle test at
LLM, you know, you fuse shot in this
way, you end up with many many fshot
examples and and you know, you start get
your latency goes up and your
performance goes down. And so this is an
area we experimented with kind of a
custom approach. And our sort of key
insight to this problem was to split the
task to first say built train a
classifier that was doing a good job at
figuring out if there actually was an
issue there or not. And then if that
classifier said, "Yeah, there is still
an issue here." then to uh to kind of
use a small LM to uh to to then
summarize the issue. And so uh you know
this is a kind of a schematic of the
classifier piece. Again modern births
been a great building block here um
trained on you know data from our
experience. And then we have uh Laura
fine-tuned um Quentry 14B uh which was
good enough to uh to kind of to equal in
our evaluations uh GPT 4.1 for this
summarization task and um and then you
know we get like a more performant model
but kind of more importantly um you know
only only slight increase in resolutions
here cost reduction but more importantly
improve the quality of our product um
moving something you don't do it lightly
but moving something from the LLM to a
propri system that you control does
enable you to to get more fine grain
control and more fine grain tuning of
it. Um, and I'm kind of I've got two
more to go and then I'm going to tie
this into an overall narrative and then
take questions. We built a custom
escalation detection model for Finn as
well. Um, escalation when when somebody
is getting fed up with Finn and they
want to talk to a human, it's just not
helping them. It's very important
delivering a high quality enduser
experience. But it's a difficult thing
to to build a machine learning model for
because our different customers provide
guidance in the form of free text input
that kind of defines a policy for when
FIN should escalate and when it
shouldn't. And um we really uh you know
we have a sometimes based on the
guidance the guidance will be like
definitely escalate if this happens.
Sometimes it'll be like definitely left
an answer and then sometimes there's a
gray area in between. And so we spent a
while working on this. We tried, you
know, a Gemma fine tune. We tried Quen
models of different sizes and that was
good, but um we were kind of keeping to
push to see if we could find a a smaller
better model. We actually ended up uh
training sort of again um a custom model
using encoder uh backbone model using
modern BERT as a building block and with
a multiclass classification on top of
it, multiple different uh classification
heads. And um this worked out really
well for us. In the end, we got a
resolution rate increase. Uh latency
decreased by about half a second. Cost
per resolution decreased by 3% and we
got finer grain control. And it's like,
you know, each one of these models is is
incremental. It's like 3% here, 5%
there. But it all adds up. You know,
when you do it at scale, h you start to
end up with an application that starts
to look a bit differentiated. And the
last thing I have to talk about today is
we also built a feedback model. Right?
Sometimes people feedback is tough,
right? If if you want to have a product
like Finn and it's dealing with real
users, it's not easy to to extract
feedback from that. So, you know, an
example here that can of build that
intuition is someone might say, Finn
might say, "Oh, yeah, to cancel your
subscription, you need to do X, Y, and
Z. Was that helpful?" And the kind of
thing real people say back is like,
"Yes, but apparently my email is also
wrong. What should I do next?" And this
is a really hard challenge for a machine
learning system. It's getting easier,
but traditionally it's hard. We ended up
building sort of a multitask
architecture to do that where we had
like three different classification
heads. One was for like feedback. Is
there no feedback? Is it positive and
negative? Another one, has the user got
a follow-on question? And a third
classification head was, you know, have
they ended the conversation or not? With
sort of a a shared modern birth layer.
And uh this worked really well for us.
we get once again got like a smaller
more efficient model h with really high
overall accuracy with several other kind
of initiatives in flight like this. So
that's sort of a a whistle stop tour of
some of our investments. Um but I guess
like why am I sharing this? There's a
thesis here that I want to kind of to to
to kind of to to deliver in our
experience. Now you can get really good
performance by replacing LLM calls with
more special purpose models. We still we
still use Antropic for our our our
hardest LLM task of like actually
synthesizing an answer to a question
from the rag content. But all the other
pieces of Finn like they all work
together to add up to a good product
experience. We've been able to improve
our business metrics, improve our
resolution rate, improve our margin
substantially, and then also get much
more fine grained control in terms of
like the quality product uh the quality
tradeoffs and metrics. That's one
takeaway. We're pretty happy with how
this has worked out. Data from a high
performing Frontier application has
turned out to be a very valuable
building block for us, more so than we
anticipated, right? And I guess that's
one thing we're sharing is like, hey,
um, our production data turned out to
make a relatively small investment to
get really massive returns. Like it
really surprised us that our model
turned out a reranker in particular that
was was better than cohhere. Um, and you
know, Coher is still a great model. Um,
but uh, certainly for Finn and even
cross customer, even out of sample on a
customer basis, um, it worked really
well. So we kind of think that there's
an emerging pattern here that we would
suggest to anyone with a deep AI
product. You know, start out build your
product expensively, get the quality
really great, stabilize the product, and
then go and optimize it. And there's a
lot of room for optimizing over time.
And this is how we believe in in trying
to add sustainable value um at the AI
layer. And that's really what we're
doing. And we're sharing a lot of this
information for the first time today
because in the past we have a habit of
building really good technology and like
not talking about it much and we're
trying to change that and talk about a
lot and so we have just published a
series of blogs if you want to get into
a lot of technical detail on each one of
these things I've talked about um
available at finalai research and
obviously we have technical
presentations here. We're really
thankful to such a great audience for
coming out and I would love to answer
any of your questions uh briefly. Thank you.
I think we have roving mics if anyone
wants to uh put up their hand and ask a
>> Yeah, thanks for the talk. Uh you
mentioned using log data for training
reinkers and retrieval models. Can you
talk a bit more about your experiments
with LLMs as teacher models? cuz for
things like relevance I assume they're
still very powerful and you can distill
down to modern bird.
>> Yeah. Um I I think LLMs as teacher
models can work obviously um you know
you have to pay careless attention to
the terms of service of the LLM that
you're using. Uh there are open LLMs
that are very powerful these days and
they can work very well as teacher
models and uh you know definitely there
are um yeah we think there's a lot to be
done there. We think that there's an
awful lot of things that people are
using in production where they have a
big heavy LLM and then that can be a
great way to get started but that uh you
know it's been surprising to us how well
the modern BERT style encoder decoder
models can work uh w with a teacher
model perhaps a large openweight teacher
model. So yeah I I think I that that's a
great direction. I'm very bullish in
that direction of investment. Yeah,
maybe one or two more. Um have a mic for
Hi, thank you for taking my question and
great presentation.
>> Thank you.
>> Uh I was curious how are you attributing
acquisition of new customers to the
advancements that you've made to Finn?
Um you know meaning have you gotten
customer feedback that the new um
improvements uh have led to expansion?
It could also be just a coincidence. I
was just curious.
>> Sure. No, absolutely. So, um, you know,
we compete a lot on the quality of our
product and we build based on a
successful resolution. So, we were very
early adopter of outcomebased pricing.
We build a dollar when Finn successfully
answers the question. And so, there is a
sense in which improvements in the core
products directly add to revenue for us.
But also we find that customers,
especially sophisticated customers, they
run Finn in head-to-head trials against
other competing products. And we really
encourage that. And the gold standard is
an AB test. Sometimes people do before
or after tests and sometimes you do an
AB test. And um so we really feel that
that's our differentiator and that's
something we compete very hard on.
That's really the reason why we we do
this and that that is the that that's
kind of the the single biggest thing
that we have deeply deeply invested in
um from a technology perspective and it
it does work for us. It does help us win
head-to-heads and convince customers to
come to Finn. We're very proud of that.
So yeah, and maybe last question here if
that's okay. This uh lady here in the uh
pale shirt.
>> Sweet. Thanks. um understand the value
of data as like a core asset to build
differentiation, but how do you think
about integrations and context when most
users outside of like core B2B likely
have an horizontal LLM up on their
window as like a split screen and how do
you build like that like platform OS,
>> right? So, I mean that is a hard
question. Um so you know
integrations and context obviously
there's a lot going on space is moving
really fast. Uh MCP is a huge big change
to the space to help people pull in
context from lots of different um you
know platforms. We we have a procedures
a tasks product. We spend a lot of time
helping people integrate Finn when they
use it with their other systems um you
know calling APIs and like being able to
kind of h handle complex queries like that.
that.
But look, I think your question is even
broader than that. It's a very evolving
field. Um there's a lot of different
players and there's a lot of people
trying to solve this problem of like I
have an AI system, but to make it really
valuable, I need to integrate it with
all these other business systems and
there's a lot of hard slog there.
There's a lot of hard work to do that.
Uh we have like teams that will partner
with a customer and try and help them do
that integration, but it's it's a lot of
leg work. It's still hard to do. MCP is
is really changing it. Yeah.
differentiation is that a driver
>> um is it a driver of differentiation for
us? Um it's definitely something we're
good at and we invest heavily in. Um I
think everybody is running around trying
to connect the AI systems to the other
systems of business. Um so I I think
it's it's a valuable thing and if you
connect it to an application that does a
great job um it'll do a good job. I I'm
not sure if it's a differentiator or
not. Okay, I better go at that. Um I
want to hand you over to uh Brett Chen
who has very kindly agreed to come and
talk to us today. A tech lead, a member
of technical staff for Perplexity. He's
going to give a keynote on scaling
intelligence. And Brett has uh done some
great work in the past, including
writing a book on lifelong machine
learning. A great chat with him before
Okay. Um hello everyone. I'm Brett. Uh
today I'm I'm going to share our
firsthand experience of building AI
agents and models to serve millions of
users and hundreds of millions of
queries at capacity. And I want to pass
on some lessons lessons and takeaways of
learned in the past and help you prevent
the similar pitfalls.
So yeah, high level I suppose some
people probably heard about pacity and
maybe use peracity but for those who
haven't I'll talk about a little bit
about proacity as a company and then
I'll showcase some of our recent agenda
products and then I'll go into the agent
AI agents and then wrap up talking about
some of the post training models.
So capacity it started three years ago
actually last month we just had the
three years anniversary. uh right now
it's valued at 18 18 billion. We have
about 300 employees globally and we have
office headquarter here in San Francisco
just like three blocks away and plato,
New York City, uh Austin and some other places.
places.
Um so this is our uh data three months
ago public data. Um and our recent data
has been even stronger. But let's talk
about yeah the data in May we have 780
million uh queries and then it's 20%
monthly growth and we have 22 million
active users um and uh AR is 100 million
and more recently our app has been
ranked as the number three in the US app
store for productivity productivity and
our end of year query target goal is 1
billion queries per week.
And the reason I said that the the data
has been uh even stronger recently
because our uh recently launched uh
browser agent uh sorry browser called uh
comet. Um so this is AI native browser
that can actually help you do a lot of
things like a summarize research
automated stuff. Um it can have the
single in action with you as user and do
a lot of compare stuff right like book
meetings like compare products across
different tabs and research topics
without you going switching back and
forth between different apps and it has
your contest. It has your section and
then it can adapt to your workflow and
preferences and it also store data
locally and keep your data private and
secure. And another very interesting one
is we has the voice integration and we
really believe that's the future in
interaction of browser because nowadays
we mostly use the keyboard and mouse
because that's just how we click the
website or type stuff right but with AI
assistant like that we don't actually
need to type to AI system we can just
talk to it just like you talk to another
person. So um yeah so let's watch a
Pull up the clip of Jensen demoing
Perplexity Labs.
I've pulled up a YouTube video showing
Jensen demoing Perplexity Labs at GTC
Paris. It should be at that moment to
formulate what is now a Gentic AI. Let's
take a look at one example. Let me show
[Laughter] [Music]
>> Yeah. By the way, something I'm always
amazed our marketing team is great at
building videos. Uh uh okay. So another
another product uh also reason one
agenda product is called deep research
and labs and it's focusing on those long
running tasks that would take a human
beings hours if not days and we can
finish them for you in minutes. uh it
can deliver in-depth and cited analysis
report and aggregate different web
sources, documents and uh remove
duplicate as well as resolve conflict
and it will summarize that and then
present you for you to make the uh best
decisions and then you also can produce
the reports in different formats so that
you can easily share with others. So for
example you can uh we have a output and
we can build some dashboard or mini apps
that you can use uh yourself uh you can
create a slice and then you can also
export into different format of
documents and again iterate them. Some
other products include like different
verticals like capacity finance where
you can look into the stock market and
economic data and really tailor that to
the um to the market or the u events
that you care about. Similarly for the
sports like you can go and uh search
particular leagues, teams and players.
And then we also have like our discover
product which is a feeder system that uh
curate and provide articles that tailor
to your interest and your needs. Um and
there are many many other products. The
point that the reason I want to bring up
this product is that is want to give you
a sense of prop capacity scale like the
number of different products that we're
building and that connects to the next
point of AI agent because it poses a
unique challenge to AI agent that we
want to build these centralized AI
agents that actually work for all these
products and meets all these products
needs. So okay let's go into the meat
and potatoes of this talk. AI agents. Um
so I'm gonna first start about the
production level AI agents like what
that means right like what kind of
considerations we have there and then
I'm going to talk about the prom I talk
about the eval personalization and then
Okay. So like this is uh just as
mentioned earlier like right like um in
early days of LM like um it's just
simple application talking to LM right
like single step integrations like
things really simple and then we have AI
agent which is the layer between apps
and models and the AI agents are like
sort of foundation of this layer right
doing the orchestration all kind of
workflow and at capacity we are not just
talking about AI agents again not just a
communication between one model and one
application actually we have access to
dozens of models from different
providers and then we use those models
build this AI agent and empower many
more products and applications so so
here we're both talking about external
models as well as we have our in-house
models so that so that's um so that's
the kind of component I'm going to talk
about and that's where I lead my team
building these uh AI agent architecture
and workflow to work for different cases.
cases.
So yeah, so let me um first of all talk
about like how do why what is like AI
right? What kind of problem it is. So
just to abstract a little bit like
instead of talking about certain
components or some details I want to
talk about like think want to frame it
as it's a multi-dimensional optimization
problem and the the different dimension
different objective and constraints
here. First one is quality and that's
what people talk about all the time,
right? Especially when the new model
comes out, right? It's like I'm I'm best
at doing this task, I'm doing that. Like
my score is X% higher than the other
ones. Um and obviously there are many
ways to evaluate it when it comes to
like accuracy when it comes to
relevance, coherence, hallucination. Um
so this is something I think most people
are familiar with. The second one is
latency. And this is something I I think
again many of you are care about like we
want the model to be fast. So we don't
want a user to wait forever right and actually
actually
intensely when I talk with people within
company as well as outside company these
are only two usually two dimensions they
think about and that's it right they
just say oh I want a high quality model
and I want a model to be fast and I
always like do you have other consider
they say no okay but actually when it
comes to production
um potassium grade agents there there
are other key factors
So the third one here is actually
reliability and availability. So this is
actually the key difference in my
opinion that would distingu distinguish
your product from others. Like from my
experience it's very easy to build a
demo or something that can achieve 80 I
would say 50 60% of success rate but
it's much harder to get to 90% 95% or
even 99%. Right? And that's where you go
from an average product to a great
product. And that's reliability. So if
there's just one message you can take
away from this talk is reliability is
what distinguish your product and the
reality here again when it comes to like
the air rate and the success rate the
like uptown and all kind of things and
when it comes to the different models is
even challenging because then we need to
balance do the all kind of load
balancing. The last one and obviously
it's still also top of mind is the cost
right we want the best model we want to
like um best quality but still they come
with a cost so there's a balance here
and so for each product we need to care
about these and there are obviously
other things as well like security all
kind of other things but these are
usually the top four that we think about
and at capacity we also need to think
about these across different
applications because it's not just one
product like we cannot not just feed uh
one model for all products, right? We
need to load balance. We need to figure
out what models works best for the product.
product.
Um okay, so let me go into some of the
touch base on some specific uh points.
Uh just want to again share our
experience and I hope you hope you can
learn something and then avoid the
porce. So I'm going to start with prompt
engineering and you may think hey Brad
like it's 2025 right why are you still
about prompting right it's it's
something that happened early days you
know at the beginning of AM but in fact
believe it or not like even at capacity
as we are doing AI all the like from the
very beginning prompting is some
prompting is something that we still
spend a lot of time and actually require
us to refactor and redesign our system
all the time it's because this is also
a field that it's moving fast. So when
it comes to prompting again conception
is very simple like there are just some
messages they are like usually different
three types system user and system right
like you put the information into one of
a type and put them together make a call
okay but when you actually again comes
to the production level there are a lot
of considerations like for example
single pro versus multiple prompts what
does it mean like in in the previously
like we had this mindset that oh we had
a new product let's create a prompt we
have another model let's create a prompt
right why not like but then we got a
expend an exponential number of prompts
and then no one can manage it like let's
say I want to update one one prompt I
don't know if I need to update others
and then these prompts all look very
different from each other so then we
started doing all the refering like make
sure we have modules we make sure we
share different templates right like
people just cannot just create a random
prompts for their products like we have
on something from some guidance there
and then there are interesting question
about okay who actually own the product
right when it comes to different teams
so it be the product team who actually
know the product better or should it be
the like say the AI team who actually
know the prompting better but may not
know the uh products as well as product
teams so again it's a balance it's it's
some sort of um middle ground but again
that's something that uh it come when it
comes different teams of product. It's
it's very tricky to figure out. Um
another thing contest engineering. I
think some people asked about cont
before like some people say contacts
everything in some sense I agree like
contest is what makes your quality goes
to the next level but then when it comes
to what contest like what kind of cont
you want to put do you want to just put
as much as you can um and that obviously
come with its side effect and that
connects to the next one with prime
caching and this is actually a pretty
big one uh like this is one of reasons
that we did a lot of redes design of our
system because previously we just didn't
pay attention to it and we like not
following the best practices and like
have people just inject uh inject fields
into the prompt as they want and we just
feel like okay that breaks the prompting
and we just leave the free money on the
table. So it's something that uh so one
rule of thumb here is if you're working
on multi time uh agents um you try to
get your pushing u cash rate to up to
80% or more but if you like constant see
the number the rate going be down below
50% that's something you should take a
look um and then there are also eval
driven one right like um again the the
things we did before was uh the product
engineer just um received a bug and did
some vi and come up with some queries
and then they they change the problem.
They did what we call the vibe checking
and if things look good, okay, we ship
it. Um that that works sometimes but
often times it didn't because as as
individual persons change a prompt like
no one actually manages and make sure
this promise still works for all the cases
cases
and so I'm going to briefly touch base
on the eval and like when it comes to
answer I think again this I'm mostly
talk about the element judge eval is a
different piece so I will skip them
there. So when it comes to the answer
like things I mentioned before but
what's interesting right now is it's not
just answer it's not just text right
we're come we're talking about more
advanced output like mini apps or slice
right the slice you don't just have
content you have the images you have the
flow right that how how do you make sure
that makes sense and when it comes to
the things like browser you have actions
right you want to click certain things
you want to type in certain things you
want to like book certain things how do
you know it actually works or not so
that's some very interesting ones. And
then format and styles. That's another
interesting one that people just, you
know, some people prefer paragraphs,
some people prefer bullet list, some
people just want a short answers, some
people want a long answers. So how do we
figure out and that connects to the
personalization and so personalization
eval is definitely another green field
opportunity. So let me also briefly talk
about personalization memory. Um so app
capacity we really believe
personalization is what makes your AI
product stand out because that's what
user feel like oh this product this AI
actually understand me they can actually
solve my needs. So we think the
personation memory as the first first
class season in the AI agent and what
that means is that we do a we do a lot
of work into figuring out what exactly
is should be store as memory for users
and that include like short-term and
long-term and that includes how can we
do actually real time update and this is
actually a very interesting one that we
build an entire infrastructure just to
make sure that if you tell me I like I
like reading books and you ask me right
away what do I like I I will tell you
right away you like reading books and
while this sounds simple again when it
comes to LM things are slow things are
brittle so you actually needs a very um
sophisticated infrastructure to enable it
it
um and then on the product side
obviously want to manage the uh or have
the transparent memory management as
well as privacy and sensitive sensitive
information certain things we don't want
to store the memory.
Okay. So, um let's also uh briefly talk
about the um MCP and the tools. So, so
this is again a fairly new field, right?
Like MCP is sort of started gaining I
would say industry traction probably
early this year, right? So, it's still
new a lot of ideas out there. So, I want
to share like what what we have been um
trying. So one thing we found is that
instead of having universal tools MCP
meaning like you just feed as many MCP
to the model as possible that just
doesn't work like you need to figure out
what are some high impact ones what are
the ones that actually make a difference
right I think probably people say search
right that's one of them and maybe like
coding so these are some common ones but
what what to your product what actually
are needed for what what kind of MCB are
needed and then also make sure reliability
reliability
um some always experience experience as
that these tools and MCBS are not
reliable. uh many of them are not
reliable because it's so new right
people are just rushed to build them and
then they have a lot of limitations and
how do we actually internally have been
also thinking about how do we build an
ecosystem right as there are more and
more MCP out there how do we actually
figure out how to integrate them how to
leverage them how to figure out that
given a particular user request what are
the best MCPS to use
and and another interesting one related
to MCP related to model is how to
actually manage the state. So because as
the task get longer and longer, we want
to with the browser want to spend
minutes and even later like an hour to
help you achieve something. The longer
it takes, the more problems you're going
to get, right? The the model may be
broken. The things the things can really
uh get worse. So how do we actually make
sure that we can recover backtrack if
things don't go our way?
Okay. So let me uh wrap up quickly with
some post training uh stuff. So I'm talk
about uh two models uh two uh items the
system and the reinforcement learning.
So the different challenge for we have
been facing when it comes to post
training. uh the one of them is just
scale right as as the scale goes by
model gets more powerful but that comes
with all kind of uh challenges
especially when it comes to the
infrastructure so we have been spending
a lot of time time building these
internally we call a lotus um learning
optimization and and tuning system it's
all in inhouse posting model that
support a large scale and can really
simple to understand the hack right like
researchers coming and like try
different configs different algorithms
different models and and quickly come up
with layout results and then we also
enable different state algorithms. Um so
so on the on the side this is architect
arch ar ar ar ar ar ar ar ar ar ar ar ar
ar ar ar ar ar ar ar ar architecture. So
happy to discuss that offline. So
another quick thing another quick
challenge is again when you come to AI
agent it's not going from the previously
chatbased right or just question answer
we want to do two calls when do MCP want
to do things uh beyond just uh text so
that comes with the different challenges
it's very noisy uh we don't know when to
stop right if user asks you to to do the
things to do ask AI to do things that
may take hours should we actually do it
or should we actually like help user
manage expectation. So we have been
training the the our own uh model
through the reinforcement link of our
own agent and environment and for
example in this case we have our browser
as the environment that uh feed the
takes the user actions and then and use
that to train our own uh two call
models. Um again happy to chat more
about it offline. Um yeah. Okay. With
that said, that's the end of my talk.
>> Thank you very much, Brett. We are big
fans of perplexity at intercom. Uh
comment went viral inside intercom. We
all compare all showing each other what
we did with it. um between Fergle and
Brad talk I think like preaching to the
choir here but the delta between a vibe
coded demo in a weekend and a at scale
performant system is it's orders of
magnitude of effort uh and there
historically not enough uh sharing of
the knowledge that we're all learning
about this. So I I love an event like
this where we're getting deep into the
details and exposing it to people to
learn from each other. We're going to
take a short break now. uh grab a drink,
stretch your legs, uh and we'll come
back again in 10 minutes uh for our last
session. Thanks. [Music]
[Music] Oh,
Oh, [Music]
[Music]
that I [Music]
[Music]
Here we go. [Music]
that I hadn't done
that I had [Music]
[Music] a
a [Music]
[Music] Ah,
Ah, a
[Music] that I
that I [Music]
I had moved on that I had moved on.
that I had moved on. [Music]
a a
a [Music]
[Music] that I have.
heat. [Music]
I had moved on. [Music]
a a
a [Music]
[Music] I love
[Music] that I have
that I have done.
done. [Music]
[Music] heat.
heat. [Music]
I [Music]
[Music] had moved on.
I had moved on. [Music]
[Music] Ah,
Ah, [Music]
duh. [Music]
[Music] that I have
that I had done. [Music]
a [Music]
[Music] a
[Music] D.
[Music] that I have.
Heat. Heat. [Music]
that I had done. [Music]
a [Music]
[Music] a
a [Music]
[Music] that I have
that I have done.
done. [Music]
Okay, welcome back. Thanks everybody. The technical papers will be open at the
The technical papers will be open at the end of the event as well. Hope you got
end of the event as well. Hope you got to enjoy the break. Uh hope you're
to enjoy the break. Uh hope you're enjoying the event so far. We have two
enjoying the event so far. We have two sessions left. The first is Molly Mahar
sessions left. The first is Molly Mahar from Finn is going to talk to us about
from Finn is going to talk to us about some of the lessons that we we have
some of the lessons that we we have learned, the hard truths we've learned
learned, the hard truths we've learned building AI products. And then we're
building AI products. And then we're going to have a panel discussion uh with
going to have a panel discussion uh with amazing leaders from cognition and
amazing leaders from cognition and Harvey and Finn talking about their
Harvey and Finn talking about their their own lessons. Um so please put your
their own lessons. Um so please put your hands together and welcome Molly to the
hands together and welcome Molly to the stage.
So, so far tonight you've been hearing about the technical challenges of
about the technical challenges of building AI. Um, I want to talk from a
building AI. Um, I want to talk from a different angle. I want to talk a bit
different angle. I want to talk a bit about the people and the org challenges
about the people and the org challenges of building AI products. So, as Fergle
of building AI products. So, as Fergle mentioned, intercom has been around for
mentioned, intercom has been around for a while. Um, so we have habits, right?
a while. Um, so we have habits, right? We have processes. And two and a half
We have processes. And two and a half years ago, we had to become an AI
years ago, we had to become an AI company. So we had to redesign, rethink
company. So we had to redesign, rethink how we design, how we ship, how we
how we design, how we ship, how we organize ourselves, right? So that that
organize ourselves, right? So that that pain of cultural change that that Ferggo
pain of cultural change that that Ferggo mentioned is the process of us like
mentioned is the process of us like doing things poorly, failing, picking
doing things poorly, failing, picking ourselves up and doing it again and
ourselves up and doing it again and again and again, right? Because a
again and again, right? Because a company is just a group of people with
company is just a group of people with their own habits, with their own
their own habits, with their own incentives, with their own expectations.
incentives, with their own expectations. And so when you're pivoting into AI, you
And so when you're pivoting into AI, you are asking all of these people to change
are asking all of these people to change the way that they work. And I don't know
the way that they work. And I don't know about you all, but like I find it very
about you all, but like I find it very hard to change other people, right?
hard to change other people, right? So these people challenges can sneak up
So these people challenges can sneak up on you if you're not prepared. And so
on you if you're not prepared. And so tonight, I wanted to share um five
tonight, I wanted to share um five painful truths that that I've
painful truths that that I've experienced at least working on AI
experienced at least working on AI products with the hope that they're at
products with the hope that they're at least on your radar um if you're
least on your radar um if you're building things uh might make your lives
building things uh might make your lives a little bit easier. So, let's get to
a little bit easier. So, let's get to it. Um, my truth number one, demos are
it. Um, my truth number one, demos are dangerous. Product orgs love to like
dangerous. Product orgs love to like share early, share often. But when
share early, share often. But when you're doing AI, like that gets really
you're doing AI, like that gets really risky. Okay, a shiny demo, it hides
risky. Okay, a shiny demo, it hides brittleleness, it hides hallucinations,
brittleleness, it hides hallucinations, it hides integration gaps behind it,
it hides integration gaps behind it, right? When you're demoing something,
right? When you're demoing something, you're making a promise about the
you're making a promise about the quality of what you're building. Um, and
quality of what you're building. Um, and if you demo too early, you get like
if you demo too early, you get like product leaders who set like marketing
product leaders who set like marketing launches and things, right, before
launches and things, right, before anything's done. You get teams aligning
anything's done. You get teams aligning around this thing that you haven't even
around this thing that you haven't even built, right? They're prematurely like
built, right? They're prematurely like rationalizing it into the product. And
rationalizing it into the product. And so, we've seen that at intercom, right?
so, we've seen that at intercom, right? We've seen product teams outside of the
We've seen product teams outside of the AI group like they have some small
AI group like they have some small little AI feature that's kind of a thin
little AI feature that's kind of a thin wrapper. So, they're like, "Yeah, this
wrapper. So, they're like, "Yeah, this will be easy to build ourselves." They
will be easy to build ourselves." They do a happy path demo. People get really
do a happy path demo. People get really excited, they get buy in, then they
excited, they get buy in, then they start to build it, right? And it all
start to build it, right? And it all kind of collapses in on itself as they
kind of collapses in on itself as they meet the unhappy path and they need our
meet the unhappy path and they need our help. So, um, here's some ways that we
help. So, um, here's some ways that we have figured out how to deal with this
have figured out how to deal with this situation. So, we, um, AI group, we
situation. So, we, um, AI group, we provide like advice and testing
provide like advice and testing resources to other product teams so we
resources to other product teams so we can get them set up for success. Uh, we
can get them set up for success. Uh, we keep a lot of projects like secret.
keep a lot of projects like secret. Sometimes we just keep them on the down
Sometimes we just keep them on the down low until we're ready. We feel like
low until we're ready. We feel like they're good enough. Um, and when we do
they're good enough. Um, and when we do finally think that they're good enough
finally think that they're good enough to demo more widely, we like hedge those
to demo more widely, we like hedge those demos, right? Like this doesn't work,
demos, right? Like this doesn't work, this is unstable, like here's where we
this is unstable, like here's where we are. Here's how far we have to go,
are. Here's how far we have to go, right? So, we did that with like the
right? So, we did that with like the Finn alpha. We're like working in secret
Finn alpha. We're like working in secret and then when we finally demo, we're
and then when we finally demo, we're just very clear about uh what's risky
just very clear about uh what's risky and what's still unknown. So the
and what's still unknown. So the takeaway there I think for you all is
takeaway there I think for you all is demo only what you're willing to be
demo only what you're willing to be accountable for and be really clear
accountable for and be really clear what's risky and what's unstable because
what's risky and what's unstable because as soon as you show something shiny that
as soon as you show something shiny that polish it communicates a stability that
polish it communicates a stability that your product does not have yet. Um that
your product does not have yet. Um that takes us to truth number two. Polish is
takes us to truth number two. Polish is a trap. Uh do you hear that much from a
a trap. Uh do you hear that much from a designer? Um there's a normal tension in
designer? Um there's a normal tension in product teams like do we ship fast? Do
product teams like do we ship fast? Do we ship high quality? But when you're
we ship high quality? But when you're working in AI, you have this new element
working in AI, you have this new element which is is this thing even feasible to
which is is this thing even feasible to build at all. Right? So you've got
build at all. Right? So you've got designers who really want to polish
designers who really want to polish something. You've got product leaders
something. You've got product leaders who want like brand consistency and then
who want like brand consistency and then you have ML teams who like need to get
you have ML teams who like need to get um something into users hands like fast,
um something into users hands like fast, right? So balancing that is really hard.
right? So balancing that is really hard. Um intercom has always had this strong
Um intercom has always had this strong critique um and feedback culture. uh we
critique um and feedback culture. uh we have the concept of like curious minor
have the concept of like curious minor major feedback that we give. We also
major feedback that we give. We also disagree and commit a lot. But even
disagree and commit a lot. But even those um like healthy processes have not
those um like healthy processes have not stopped us from falling into this like
stopped us from falling into this like death spiral, I'll call it, of like
death spiral, I'll call it, of like you're working on design for something
you're working on design for something and your design is like shaping what the
and your design is like shaping what the output of the model needs to be. And so
output of the model needs to be. And so like if you're revving on design too
like if you're revving on design too much, you're slowing down the process of
much, you're slowing down the process of the model and then you're not actually
the model and then you're not actually like able to make it robust and you
like able to make it robust and you can't design in response to that and
can't design in response to that and goes around and around and around. you
goes around and around and around. you get stuck and it's super painful and
get stuck and it's super painful and super frustrating and so how do we try
super frustrating and so how do we try to handle that right um we've kept the
to handle that right um we've kept the feedback culture like that part's
feedback culture like that part's actually great the transparency and the
actually great the transparency and the clarity that we have with between teams
clarity that we have with between teams um kind of the new thing we have is my
um kind of the new thing we have is my my my role here the AI designer role and
my my role here the AI designer role and I sit between like we sit between design
I sit between like we sit between design team and the ML team and we are like
team and the ML team and we are like this bridge right so we join a project
this bridge right so we join a project from day on um we are ensuring the
from day on um we are ensuring the system quality from like a UX point of
system quality from like a UX point of view and we're balancing that with
view and we're balancing that with polish. So we are actually um like in
polish. So we are actually um like in the weeds with the scientists as an
the weeds with the scientists as an example when we were building the Finn
example when we were building the Finn alpha um I was working on I was working
alpha um I was working on I was working on that and felt like there were gaps in
on that and felt like there were gaps in the quality of Finn's answers and we
the quality of Finn's answers and we were like really close to going to
were like really close to going to launch and I'm like these are not quite
launch and I'm like these are not quite good enough. So um so I went in and like
good enough. So um so I went in and like started writing my own prompt like doing
started writing my own prompt like doing prompt engineering writing my own
prompt engineering writing my own variations doing offline evals and
variations doing offline evals and testing and like getting a a sense of
testing and like getting a a sense of like how does this model work and what
like how does this model work and what are the limits to this right coming up
are the limits to this right coming up with like a good proposal to make to
with like a good proposal to make to Fergle and to the scientists and
Fergle and to the scientists and convincing them that like my version is
convincing them that like my version is actually better it's a better experience
actually better it's a better experience for the users and then like handing that
for the users and then like handing that off to scientists to make it robust
off to scientists to make it robust right and so that's what ended up
right and so that's what ended up launching
launching Um so that's kind of how we act as a
Um so that's kind of how we act as a bridge and helps keep this like instinct
bridge and helps keep this like instinct to polish like in check. Uh so the
to polish like in check. Uh so the takeaway there I think is just resist
takeaway there I think is just resist that urge to polish before feasibility
that urge to polish before feasibility and value are actually proven right
and value are actually proven right because debates about value about um
because debates about value about um polish. They sound like they're about
polish. They sound like they're about craft, but they're about focus. Um, in
craft, but they're about focus. Um, in AI, like your focus like has to shift
AI, like your focus like has to shift overnight sometimes, right? You there's
overnight sometimes, right? You there's a breakthrough, there's a dead end. Um,
a breakthrough, there's a dead end. Um, suddenly your road map is like out the
suddenly your road map is like out the window, right? That's truth number
window, right? That's truth number three. Road maps, they will fail you.
three. Road maps, they will fail you. Uh, static any static plans you have are
Uh, static any static plans you have are just going to collapse when the models
just going to collapse when the models surprise you. Um, like it just does not
surprise you. Um, like it just does not work. You have to be rep prioritizing
work. You have to be rep prioritizing all the time. Intercom used to work in
all the time. Intercom used to work in these six-w week product cycles. So they
these six-w week product cycles. So they had I I saw them I like I've like seen
had I I saw them I like I've like seen these nice schedules that the teams used
these nice schedules that the teams used to have each week they knew exactly what
to have each week they knew exactly what they were working on for the next six
they were working on for the next six weeks. They had cross teamam alignment.
weeks. They had cross teamam alignment. Um that is like no more right that's
Um that is like no more right that's totally gone. Um instead what we do now
totally gone. Um instead what we do now we have we have these like flexible
we have we have these like flexible workstream model where like people and
workstream model where like people and tasks like can get reallocated on demand
tasks like can get reallocated on demand right so the shape can always be
right so the shape can always be shifting and we can be really responsive
shifting and we can be really responsive to the needs of like any project at any
to the needs of like any project at any time. Um so we do that as we're doing
time. Um so we do that as we're doing weekly planning we ask uh are there any
weekly planning we ask uh are there any big bets that we're not making that we
big bets that we're not making that we need to make right now. Um, we also ask,
need to make right now. Um, we also ask, okay, what are the items that we really
okay, what are the items that we really need to derisk right now the most this
need to derisk right now the most this week? So, they're kind of we're kind of
week? So, they're kind of we're kind of like a multi-armed bandit, right? We're
like a multi-armed bandit, right? We're exploring big bets that we haven't made
exploring big bets that we haven't made and we're exploiting the things that we
and we're exploiting the things that we know have value that we need to to build
know have value that we need to to build deeper on. And that rhythm, it feels
deeper on. And that rhythm, it feels natural to um like ML folks, but it
natural to um like ML folks, but it feels like total chaos to product teams
feels like total chaos to product teams who are used to working in these six
who are used to working in these six week cycles, right? So there's hidden
week cycles, right? So there's hidden costs to having to deal um having to
costs to having to deal um having to work this way. You have to deal there's
work this way. You have to deal there's like relationship management, right?
like relationship management, right? Because people are always shifting
Because people are always shifting around. You have to like gain people's
around. You have to like gain people's trust. You're always renegotiating
trust. You're always renegotiating everything. Like that's tough, right?
everything. Like that's tough, right? That's hard for people to do over and
That's hard for people to do over and over and over again. Uh but ultimately
over and over again. Uh but ultimately survival is about this ruthless constant
survival is about this ruthless constant rep prioritization.
rep prioritization. Um so you just have to kind of deal with
Um so you just have to kind of deal with it. Um because customers, execs, other
it. Um because customers, execs, other teams, they all want something from you,
teams, they all want something from you, right? Um and so that's why no is
right? Um and so that's why no is non-negotiable. Um no is necessary.
non-negotiable. Um no is necessary. That's my fourth truth of the night. Um
That's my fourth truth of the night. Um customers are great. I love customers.
customers are great. I love customers. They have a lot of expectations. They
They have a lot of expectations. They have a lot of requests, right? How do
have a lot of requests, right? How do you choose what to build for them? Um we
you choose what to build for them? Um we are lucky. we get a lot of good feedback
are lucky. we get a lot of good feedback that's grounded in like real workflows.
that's grounded in like real workflows. Um, when we get negative feedback like
Um, when we get negative feedback like this will not work for me, that's great,
this will not work for me, that's great, right? We can really trust trust that.
right? We can really trust trust that. But when we get um like positive like I
But when we get um like positive like I think this would be really cool, it's a
think this would be really cool, it's a lot harder to know whether that's a real
lot harder to know whether that's a real need that they have or whether they just
need that they have or whether they just saw something like shiny on someone
saw something like shiny on someone else's demo. Um, and so like parsing out
else's demo. Um, and so like parsing out like what you should actually work on
like what you should actually work on there can be very very hard and it's
there can be very very hard and it's also very scary. Um, it's scary as a
also very scary. Um, it's scary as a person when you're you're like no to a
person when you're you're like no to a customer, right? And they're threatening
customer, right? And they're threatening to churn or you say no to your execs who
to churn or you say no to your execs who like want something specific. Um, or you
like want something specific. Um, or you say no to like five other teams who want
say no to like five other teams who want something from you and you just feel
something from you and you just feel like a dirt bag kind of to say no all
like a dirt bag kind of to say no all the time. We just make Fergle say it all
the time. We just make Fergle say it all the time. So that's like easier for me.
the time. So that's like easier for me. Um but saying no lets you work on the
Um but saying no lets you work on the bets that you are making, right? So
bets that you are making, right? So what do we what do we do to try to like
what do we what do we do to try to like do this? Well, um we use usage data and
do this? Well, um we use usage data and honest feedback to try to separate out
honest feedback to try to separate out like the shiny stuff from the real
like the shiny stuff from the real stuff. We look really hard at whether
stuff. We look really hard at whether something we're thinking about making is
something we're thinking about making is a good business decision or if it's just
a good business decision or if it's just like a really expensive API call in
like a really expensive API call in disguise. Um, and like generally our
disguise. Um, and like generally our default answer is like no actually. Um,
default answer is like no actually. Um, and it gets easier the more you do it
and it gets easier the more you do it because demands will overwhelm you. But
because demands will overwhelm you. But like saying no is not failure, it's
like saying no is not failure, it's focus. Um, but one of the hardest things
focus. Um, but one of the hardest things we've like gone through lately is that
we've like gone through lately is that no only works if people have authority
no only works if people have authority to make it stick, right? Um, so the last
to make it stick, right? Um, so the last truth is ownership can sink you. Um like
truth is ownership can sink you. Um like products live and die by like who's the
products live and die by like who's the DRRi right the directly responsible
DRRi right the directly responsible individual if you have the wrong owner
individual if you have the wrong owner at the wrong time like you can totally
at the wrong time like you can totally sync your product because people have
sync your product because people have like their own agendas right so intercom
like their own agendas right so intercom used to like the ownership model um we
used to like the ownership model um we had before was this triad model with
had before was this triad model with like a PM a designer and an engineering
like a PM a designer and an engineering manager and they made decisions like
manager and they made decisions like collaboratively and that's great for
collaboratively and that's great for like working together and like having a
like working together and like having a lot of agreement but it's a lot slower
lot of agreement but it's a lot slower it dilutes the decision speed. So we've
it dilutes the decision speed. So we've tried new things and new models um as we
tried new things and new models um as we are like working on AI products. So we
are like working on AI products. So we tried just like PME uh teams but uh two
tried just like PME uh teams but uh two things we've noticed marketing pressure
things we've noticed marketing pressure tends to like creep in there um and and
tends to like creep in there um and and and like push to like launch too early
and like push to like launch too early and then it can be hard sometimes to to
and then it can be hard sometimes to to say um no to like a lot of customer
say um no to like a lot of customer demands from some like big customer,
demands from some like big customer, right? that you might be trying to
right? that you might be trying to trying to like build for. Um, we've also
trying to like build for. Um, we've also tried MLE uh teams and it can be hard
tried MLE uh teams and it can be hard like you build it, it's great quality,
like you build it, it's great quality, but like then you have to hand off to a
but like then you have to hand off to a product team to like own it and
product team to like own it and they might feel a lack of ownership,
they might feel a lack of ownership, right? Or a lack of vision in like what
right? Or a lack of vision in like what you've built and so they might you might
you've built and so they might you might not get enough like investment
not get enough like investment afterwards. Um so what we do now we have
afterwards. Um so what we do now we have this like PM as a DRRI and then strong
this like PM as a DRRI and then strong um like a technical ML lead and a design
um like a technical ML lead and a design lead who are like advocating for those
lead who are like advocating for those for for our positions. But like the PM
for for our positions. But like the PM is kind of you have a single decision
is kind of you have a single decision maker, right? So you can move faster and
maker, right? So you can move faster and we found that to be like pretty smooth.
we found that to be like pretty smooth. It's not perfect. We're like still
It's not perfect. We're like still working things out. Like one of the
working things out. Like one of the tough things is that doesn't necessarily
tough things is that doesn't necessarily work the same way for every project
work the same way for every project because you've got different people,
because you've got different people, right? And people are not
right? And people are not interchangeable totally like on on
interchangeable totally like on on different projects. So things work
different projects. So things work differently. Um but the the takeaway I
differently. Um but the the takeaway I think is you you have to scope decision
think is you you have to scope decision rights as carefully as you're scoping
rights as carefully as you're scoping your features um when you're trying to
your features um when you're trying to build these because um if your ownership
build these because um if your ownership fails like all those nos you said they
fails like all those nos you said they mean nothing like your road map still
mean nothing like your road map still collapses. polish doesn't matter and
collapses. polish doesn't matter and your demos like they were all false
your demos like they were all false promises because like your AI product is
promises because like your AI product is just like off track, right? So I think
just like off track, right? So I think competing in AI means you have to live
competing in AI means you have to live these truths like every week. So you
these truths like every week. So you can't just ask like is my is our model
can't just ask like is my is our model ready? You have to ask like is our
ready? You have to ask like is our company ready to handle all this
company ready to handle all this pressure? Am I ready to like hold the
pressure? Am I ready to like hold the line in a tough situation? Um I think
line in a tough situation? Um I think yeah it like one half of like survival
yeah it like one half of like survival is like do you have a really great
is like do you have a really great model? Do you have really great tech?
model? Do you have really great tech? But like the other half is like do you
But like the other half is like do you have a bunch of people who are willing
have a bunch of people who are willing to like deal with really uncomfortable
to like deal with really uncomfortable situations and like hard stuff and like
situations and like hard stuff and like maybe go cry in the bathroom and then
maybe go cry in the bathroom and then come out and like work together and like
come out and like work together and like leave at the end of the day with like a
leave at the end of the day with like a smile on their face happy to come back
smile on their face happy to come back the next day. Um because I think you
the next day. Um because I think you need both of those parts to be able to
need both of those parts to be able to really be successful at building AI
really be successful at building AI products.
products. Thank you. That's it.
Thank you so much, Molly. um truly hard-fought lessons. The last couple of
hard-fought lessons. The last couple of years as we've transformed intercom from
years as we've transformed intercom from this historical SAS company to an AI
this historical SAS company to an AI first company, it has been blood, sweat,
first company, it has been blood, sweat, and tears. And it's it's uh uh
and tears. And it's it's uh uh everything we thought we could take for
everything we thought we could take for granted has changed. Uh it's been really
granted has changed. Uh it's been really really fun. Okay. Uh our last session
really fun. Okay. Uh our last session then we're going to do a panel with uh
then we're going to do a panel with uh amazing leaders from these companies. We
amazing leaders from these companies. We have uh Nico Grupin from Harvey, the AI
have uh Nico Grupin from Harvey, the AI for law firms and the Fortune 500. Uh we
for law firms and the Fortune 500. Uh we have Silus Albertie from Cognition, the
have Silus Albertie from Cognition, the team behind Devon and Windsurf and a
team behind Devon and Windsurf and a person called Fergle Reed behind a
person called Fergle Reed behind a product called Finn you might have heard
product called Finn you might have heard of. Uh please welcome all to the stage.
of. Uh please welcome all to the stage. [Applause]
[Applause] Okay, thank you very much for for doing
Okay, thank you very much for for doing this folks. Um, if our thesis tonight is
this folks. Um, if our thesis tonight is it turns out that building great AI
it turns out that building great AI products is more than just a thin wrap
products is more than just a thin wrap around an LLM to say the least and that
around an LLM to say the least and that there's durable advantage in that. My
there's durable advantage in that. My softball is like do you agree and what
softball is like do you agree and what does that mean for you? Maybe start with
does that mean for you? Maybe start with you Nico.
you Nico. >> Yeah. Uh, we can go in a number of
>> Yeah. Uh, we can go in a number of different directions with this one. So
different directions with this one. So first of all I think the way that you
first of all I think the way that you frame the question absolutely aligns
frame the question absolutely aligns with my mental model which is the
with my mental model which is the product as the focal point. Um I think
product as the focal point. Um I think our story though actually starts even
our story though actually starts even one step earlier than that which is by
one step earlier than that which is by partnering with our customers in this
partnering with our customers in this case law firms embedding ourselves or
case law firms embedding ourselves or immersing ourselves in their workflows
immersing ourselves in their workflows and understanding their core problems
and understanding their core problems right so uh something we take a lot of
right so uh something we take a lot of inspiration from at Harvey is um
inspiration from at Harvey is um unreasonable hospitality so this is of
unreasonable hospitality so this is of course a reference to Will Gera his book
course a reference to Will Gera his book uh Madison 11 Madison Park the
uh Madison 11 Madison Park the restaurant in New York um and the
restaurant in New York um and the operation that they've been able to spin
operation that they've been able to spin up quite successfully. They are large in
up quite successfully. They are large in large part due to their unique approach
large part due to their unique approach to services-based work, right? Um, and
to services-based work, right? Um, and so what this means, you know, from a
so what this means, you know, from a product development standpoint, what
product development standpoint, what what we're trying to convey by taking
what we're trying to convey by taking that as inspiration is it actually
that as inspiration is it actually starts with deep customer obsession.
starts with deep customer obsession. What are your users and what are their
What are your users and what are their core problems? Only after you understand
core problems? Only after you understand that can you work your way backwards to
that can you work your way backwards to the product and product experience
the product and product experience that's needed to solve those problems.
that's needed to solve those problems. and then you work your way backwards to
and then you work your way backwards to the AI models and systems that are
the AI models and systems that are needed to support that product
needed to support that product experience. Um, in my experience, it's
experience. Um, in my experience, it's really challenging to try to push AI
really challenging to try to push AI functionality in the other direction.
functionality in the other direction. Um, given how quickly the ecosystem is
Um, given how quickly the ecosystem is moving, there's a lot of incentives and
moving, there's a lot of incentives and external pressures to do that. Um, and
external pressures to do that. Um, and then when it comes to actually building
then when it comes to actually building a product, yeah, I get asked this
a product, yeah, I get asked this question all the time, which is what's
question all the time, which is what's challenging about building at the
challenging about building at the application layer. I think the reality
application layer. I think the reality is and and I really don't think that
is and and I really don't think that enough people talk about this is that
enough people talk about this is that it's all challenging, right? There's
it's all challenging, right? There's there's lit quite literally no easy
there's lit quite literally no easy part, right? There are certainly
part, right? There are certainly frameworks you can use to kind of scope
frameworks you can use to kind of scope or frame the difficulty of a problem. In
or frame the difficulty of a problem. In the AI world, the things I'm thinking of
the AI world, the things I'm thinking of are things like, you know, how much do
are things like, you know, how much do domain expertise is required to solve
domain expertise is required to solve the problem? How verifiable are your
the problem? How verifiable are your outcomes? Uh how much does the problem
outcomes? Uh how much does the problem space rely on manual processes or tribal
space rely on manual processes or tribal knowledge? Right? Right? And that helps
knowledge? Right? Right? And that helps you frame the AI problem, but it's just
you frame the AI problem, but it's just one in a basket of other problems
one in a basket of other problems including infrastructure integrations,
including infrastructure integrations, you know, security and privacy are table
you know, security and privacy are table stakes for enterprise use cases and of
stakes for enterprise use cases and of course intuitive UX and design. So the
course intuitive UX and design. So the main takeaway for me is when you sign up
main takeaway for me is when you sign up to build at the application layer,
to build at the application layer, you're signing up to solve this whole
you're signing up to solve this whole problem, right? And you have to master
problem, right? And you have to master each of the individual components to
each of the individual components to deliver a valuable product to your
deliver a valuable product to your customers.
customers. >> And it feels fractal, right? It feel it
>> And it feels fractal, right? It feel it feels like going from like it's like
feels like going from like it's like it's like getting 59's reliability or
it's like getting 59's reliability or something, right? It's like the deeper
something, right? It's like the deeper you go into it, the harder and harder it
you go into it, the harder and harder it gets. You start with the idea of like it
gets. You start with the idea of like it sounded like I interpret as you do to me
sounded like I interpret as you do to me an expertise. You really need like
an expertise. You really need like before you even get to like the AI, you
before you even get to like the AI, you need to deeply understand what problem
need to deeply understand what problem you're trying to solve.
you're trying to solve. >> Um Silus, I think like the I've heard
>> Um Silus, I think like the I've heard something similar from Devon and like
something similar from Devon and like the something that's very compelling
the something that's very compelling about the way I've heard Devon being
about the way I've heard Devon being framed is it's not a coding agent. It's
framed is it's not a coding agent. It's trying to solve the job of software
trying to solve the job of software engineering. Is that like do I have that
engineering. Is that like do I have that right?
right? >> That's correct. Yeah. Yeah. I think I
>> That's correct. Yeah. Yeah. I think I also very much agree with the overall
also very much agree with the overall thesis here. I feel like two years ago,
thesis here. I feel like two years ago, everybody was talking about these like,
everybody was talking about these like, oh yeah, the labs are going to eat
oh yeah, the labs are going to eat everything. There's these like thin
everything. There's these like thin wrappers around it. And I think we
wrappers around it. And I think we started out being this like applied AI
started out being this like applied AI lab and also not really sure initially
lab and also not really sure initially what is going to be the bulk of the
what is going to be the bulk of the stack that we're going to own. And then
stack that we're going to own. And then we started actually talking to customers
we started actually talking to customers and building stuff for them and noticed
and building stuff for them and noticed the problem of actually delivering value
the problem of actually delivering value to real engineering organizations is
to real engineering organizations is actually a very deep product problem. So
actually a very deep product problem. So it started with like all the
it started with like all the infrastructure on actually enabling
infrastructure on actually enabling software engineering agents to work in
software engineering agents to work in like real enterprise environments from
like real enterprise environments from like virtual machines to run the code
like virtual machines to run the code and all the plumbing to connect them to
and all the plumbing to connect them to your like AWS and your Jira and your
your like AWS and your Jira and your linear and your GitHub and also all the
linear and your GitHub and also all the different interfaces you want to build
different interfaces you want to build whether it's like a web app integrating
whether it's like a web app integrating with slack and even now the IDE with
with slack and even now the IDE with surf
surf and I think the other deep product
and I think the other deep product problem that we think a lot about is
problem that we think a lot about is interface interfaces.
interface interfaces. So on the one hand obviously like the
So on the one hand obviously like the IDE is like a pretty big interface for
IDE is like a pretty big interface for software engineering but also we all
software engineering but also we all kind of believe okay that might not be
kind of believe okay that might not be the interface in five years but we also
the interface in five years but we also don't think it's just going to be a chat
don't think it's just going to be a chat and I think a lot of what we think about
and I think a lot of what we think about is what is the real like future
is what is the real like future interface of how people write code and
interface of how people write code and we think it's a pretty challenging
we think it's a pretty challenging design problem that also involves
design problem that also involves co-designing
co-designing ML systems and even models and um I
ML systems and even models and um I think it hasn't been solved yet.
think it hasn't been solved yet. >> No. Yeah, definitely hasn't. The the
>> No. Yeah, definitely hasn't. The the idea of uh we'll get into the model
idea of uh we'll get into the model stuff certainly the idea of interacting.
stuff certainly the idea of interacting. There's something in my my mind about
There's something in my my mind about like if you're trying to solve I'd love
like if you're trying to solve I'd love to understand the equivalent in the law
to understand the equivalent in the law use case. Um but we think about this a
use case. Um but we think about this a lot with Finn is like if you're trying
lot with Finn is like if you're trying to solve for us the job of like
to solve for us the job of like replacing what a customer service rep
replacing what a customer service rep does. It isn't just answering questions.
does. It isn't just answering questions. They have to interact with the rest of
They have to interact with the rest of the team. uh and Devon uh silus like you
the team. uh and Devon uh silus like you know like the surfaces like how it
know like the surfaces like how it interacts with the ecosystem it's in um
interacts with the ecosystem it's in um like what kind of what kind of
like what kind of what kind of interaction patterns and services are
interaction patterns and services are you imagining might be in the design
you imagining might be in the design envelope there
envelope there >> a lot of things I mean it starts with
>> a lot of things I mean it starts with the source of the task right so the task
the source of the task right so the task might it might come from like right in
might it might come from like right in the developer seat in the IDE but it
the developer seat in the IDE but it could also come from some customer bug
could also come from some customer bug report in Slack and then somebody tags
report in Slack and then somebody tags Devon right in the thread It could also
Devon right in the thread It could also come from some issue tracking system
come from some issue tracking system like linear. And we even imagine um um a
like linear. And we even imagine um um a lot of other like source of tasks that
lot of other like source of tasks that are already possible with MCP
are already possible with MCP integrations like um data dog alerts
integrations like um data dog alerts triggering automatic agent triage
triggering automatic agent triage >> or or Finn like I'm looking forward to
>> or or Finn like I'm looking forward to the day where Finn opens a task for for
the day where Finn opens a task for for Devon in linear and then gives Deon a
Devon in linear and then gives Deon a hard time like hey it's been two weeks
hard time like hey it's been two weeks and the customer's asking for an update.
and the customer's asking for an update. What's going on Devon? like it's going
What's going on Devon? like it's going to happen, right?
to happen, right? >> That would be sick. Yeah, I think you
>> That would be sick. Yeah, I think you know, right from customer bug report in
know, right from customer bug report in Finn to like PR.
Finn to like PR. >> Yeah, absolutely. Um, yeah, the the law
>> Yeah, absolutely. Um, yeah, the the law use case, what like what is the shape of
use case, what like what is the shape of the the interfaces you might have in in
the the interfaces you might have in in in Harvey?
in Harvey? >> Yeah. Well, first of all, I think
>> Yeah. Well, first of all, I think Silus's point on interfaces is spoton.
Silus's point on interfaces is spoton. In fact, we call the discipline applied
In fact, we call the discipline applied research at Harvey very intentionally,
research at Harvey very intentionally, right? And what that's intended to
right? And what that's intended to convey is that our responsibility here
convey is that our responsibility here is actually equal parts AI and HCI. And
is actually equal parts AI and HCI. And what I'm referring to is human computer
what I'm referring to is human computer interaction. Um on the HCI side, it's
interaction. Um on the HCI side, it's all about what is the right mode of
all about what is the right mode of interaction with the models and with
interaction with the models and with these AI systems. Not just generally,
these AI systems. Not just generally, but for our specific users who are legal
but for our specific users who are legal professional service practitioners. The
professional service practitioners. The biggest sort of transition we've seen
biggest sort of transition we've seen here is there are some extremely complex
here is there are some extremely complex tasks that these folks are taking on on
tasks that these folks are taking on on a day-to-day week- toeek basis. So if
a day-to-day week- toeek basis. So if you imagine something like fund
you imagine something like fund formation, right? If you're a private
formation, right? If you're a private equity firm, you're raising a new fund.
equity firm, you're raising a new fund. This is something that will take
This is something that will take multiple weeks, potentially multiple
multiple weeks, potentially multiple months. You there are negotiations
months. You there are negotiations between a number of parties,
between a number of parties, correspondence between lawyers, between
correspondence between lawyers, between lawyers and clients with LPs to
lawyers and clients with LPs to negotiate specific terms and carveouts
negotiate specific terms and carveouts and side litter agreements. incredibly
and side litter agreements. incredibly complex process, right? It's not clear
complex process, right? It's not clear and in fact I'd go far so far as to say
and in fact I'd go far so far as to say is like it's not going to cut it to have
is like it's not going to cut it to have a light kind of multi-turn interaction
a light kind of multi-turn interaction or a chat interface with that, right?
or a chat interface with that, right? Really what our users are craving and
Really what our users are craving and that the direction we're starting to
that the direction we're starting to steer our product is towards a
steer our product is towards a persistent workspace that houses all of
persistent workspace that houses all of the data and information and work
the data and information and work product that you need. Right? So if
product that you need. Right? So if you're going to raise a new fund, you
you're going to raise a new fund, you can show up to this workspace. It has
can show up to this workspace. It has all of your historical precedent from
all of your historical precedent from deals you've done in the past. you know,
deals you've done in the past. you know, intermediate work product that's
intermediate work product that's completed by the legal team as you go.
completed by the legal team as you go. Um, all of the correspondence, email
Um, all of the correspondence, email threads, back and forth, attachments
threads, back and forth, attachments with legal counsel and clients. Um, and
with legal counsel and clients. Um, and then eventually kind of the finalized
then eventually kind of the finalized kind of polished work product that you
kind of polished work product that you end up using to sign and close the deal.
end up using to sign and close the deal. Um, so all of this is self-contained in
Um, so all of this is self-contained in one workspace. And then as the the
one workspace. And then as the the process kind of unfolds in these phases,
process kind of unfolds in these phases, you can delegate tasks to agents to
you can delegate tasks to agents to complete along the way. You can delegate
complete along the way. You can delegate tasks to humans along the way to
tasks to humans along the way to complete. You can delegate tasks to
complete. You can delegate tasks to human agent teams to complete along the
human agent teams to complete along the way. Um and we're already getting asked
way. Um and we're already getting asked for this to be collaborative and shared
for this to be collaborative and shared between lawyer law firms and and their
between lawyer law firms and and their clients. So um we see a transition from
clients. So um we see a transition from these lightweight almost like ephemeral
these lightweight almost like ephemeral kind of interactions to something that's
kind of interactions to something that's persistent has memory um and is
persistent has memory um and is self-contained.
self-contained. like this sorry what I was going to
like this sorry what I was going to prime me on for gole was like this is
prime me on for gole was like this is this is the amount of depth required in
this is the amount of depth required in really not just executing a task but
really not just executing a task but like solving a job end to end is is what
like solving a job end to end is is what I'm like the thing that is occurring me
I'm like the thing that is occurring me in both of these these cases
in both of these these cases >> yeah I just think it's fascinating to to
>> yeah I just think it's fascinating to to listen to this and just because you know
listen to this and just because you know there's all this narrative around like
there's all this narrative around like AI is getting more and more general and
AI is getting more and more general and will it like you know solve all these
will it like you know solve all these problems really really quickly like is
problems really really quickly like is it going to be six months and it's going
it going to be six months and it's going to be doing everything and there's just
to be doing everything and there's just so complexity to any given task. I mean,
so complexity to any given task. I mean, we see that in customer service. We
we see that in customer service. We often think about like, you know, h
often think about like, you know, h customer service can sometimes be like
customer service can sometimes be like the exception handler or the system
the exception handler or the system integrator of last resort, right? It's
integrator of last resort, right? It's it's it's the thing you go to when you
it's it's the thing you go to when you need a human to navigate your org and
need a human to navigate your org and make something happen and like the
make something happen and like the system couldn't do it automatically. And
system couldn't do it automatically. And so like you bump into the same thing
so like you bump into the same thing which is there's just an insane amount
which is there's just an insane amount of complexity to actually doing real
of complexity to actually doing real tasks um you know outside a lab or a
tasks um you know outside a lab or a back test but actually in a real messy
back test but actually in a real messy environment and just it's fascinating to
environment and just it's fascinating to hear that across the different
hear that across the different disciplines and across the different
disciplines and across the different domains
domains >> the the messiness of all the different
>> the the messiness of all the different interactions the idea of escalating to
interactions the idea of escalating to another AI agent or escalating a problem
another AI agent or escalating a problem to a human getting a human to go do
to a human getting a human to go do something for you. um it's like that's
something for you. um it's like that's there's a a lot of complex domain
there's a a lot of complex domain complexity there. But even uh something
complexity there. But even uh something that's interesting is like even if it's
that's interesting is like even if it's just you know executing a task that
just you know executing a task that requires looking up data and you know
requires looking up data and you know integrating with an API
integrating with an API >> getting a complex process to execute
>> getting a complex process to execute reliably is exceptionally hard. It seems
reliably is exceptionally hard. It seems like any any lessons maybe Fergle
like any any lessons maybe Fergle starting with like things we've learned
starting with like things we've learned from that idea of you know the combining
from that idea of you know the combining error like low errors for a complex task
error like low errors for a complex task is a is a hard problem. Yeah, we we've
is a is a hard problem. Yeah, we we've learned it's really really hard. I'd say
learned it's really really hard. I'd say that's so you know like we have this uh
that's so you know like we have this uh you know tasks product where um you'll
you know tasks product where um you'll go and try and do something like issue a
go and try and do something like issue a refund or something like that. And you
refund or something like that. And you know it it's very easy to get a demo
know it it's very easy to get a demo that like works right now and again. But
that like works right now and again. But it's very difficult if you have like six
it's very difficult if you have like six different steps that need to be
different steps that need to be completed reliably to not have like an
completed reliably to not have like an error creep in. And so, you know, our
error creep in. And so, you know, our approach at the moment with where
approach at the moment with where technology is is to try and build like a
technology is is to try and build like a a tool a tool set to help our customers
a tool a tool set to help our customers factor that big complex workflow they're
factor that big complex workflow they're trying to do into like subcomponents
trying to do into like subcomponents that an LLM can reliably execute on.
that an LLM can reliably execute on. That's the direction we're going. Um,
That's the direction we're going. Um, and uh, but yeah, I it's really hard to
and uh, but yeah, I it's really hard to know like this is definitely the
know like this is definitely the frontier. And I I think, you know, it's
frontier. And I I think, you know, it's very easy to make a demo, but it's an
very easy to make a demo, but it's an awful lot of work to actually complete
awful lot of work to actually complete this this long multi-step running
this this long multi-step running process in in in a messy unconstrained
process in in in a messy unconstrained world. I think that that's that's still
world. I think that that's that's still very frontier and maybe some distance
very frontier and maybe some distance away from certainly from LLMs doing it
away from certainly from LLMs doing it out of the box. I think you're going to
out of the box. I think you're going to need product, you know, scaffolding and
need product, you know, scaffolding and building blocks and everything around
building blocks and everything around that for for a long time.
that for for a long time. >> Y
>> Y >> that's at least our thesis.
>> that's at least our thesis. >> Yeah. No, I mean I totally agree. We
>> Yeah. No, I mean I totally agree. We have basically the the exact equivalent
have basically the the exact equivalent of that problem. Obviously, a common
of that problem. Obviously, a common legal task is large scale kind of
legal task is large scale kind of document review. Uh so we have a product
document review. Uh so we have a product called Vault that's uh intended to
called Vault that's uh intended to handle these sort of use cases. Lawyers
handle these sort of use cases. Lawyers can upload a thousand a 100 thousand
can upload a thousand a 100 thousand files at a time. We're in the process of
files at a time. We're in the process of increasing that to a million files at a
increasing that to a million files at a time. Um, and for those of you who have
time. Um, and for those of you who have uh interacted with lawyers, they're not
uh interacted with lawyers, they're not your typical document, right? Um, so an
your typical document, right? Um, so an actually really common use case we have
actually really common use case we have for our vault product is to uh upload
for our vault product is to uh upload and analyze and extract key terms from
and analyze and extract key terms from credit agreements and loan agreements.
credit agreements and loan agreements. Um, if everyone if anyone here has
Um, if everyone if anyone here has worked with a loan agreement, a single
worked with a loan agreement, a single one of these things can be 400,000
one of these things can be 400,000 tokens in length, right? which is, for
tokens in length, right? which is, for those who are wondering, longer than the
those who are wondering, longer than the Dune novel, which is enough content for
Dune novel, which is enough content for two movies, right? Two of these things
two movies, right? Two of these things is the Lord of the Rings trilogy, right?
is the Lord of the Rings trilogy, right? Um, and lawyers don't just have one or
Um, and lawyers don't just have one or two of these lying around, right?
two of these lying around, right? There's thousands of these lying around.
There's thousands of these lying around. So, you need to be able to handle that
So, you need to be able to handle that process. And so, I've said from day one,
process. And so, I've said from day one, AI is kind of the star of the show right
AI is kind of the star of the show right now. The real heroes of the application
now. The real heroes of the application layer are those who are sorting out AI
layer are those who are sorting out AI infrastructure because it's all novel
infrastructure because it's all novel infrastructure as well. Um, and then one
infrastructure as well. Um, and then one thing I want to hit on which I think you
thing I want to hit on which I think you hit on which is really unique,
hit on which is really unique, especially as agents are taking center
especially as agents are taking center stage right now. What we're seeing is
stage right now. What we're seeing is infrastructure for longunning
infrastructure for longunning asynchronous agents to complete
asynchronous agents to complete increasingly sophisticated tasks. That
increasingly sophisticated tasks. That is one mode of completing complex work,
is one mode of completing complex work, but you're not guaranteed to have the
but you're not guaranteed to have the same input every time or the same output
same input every time or the same output every time. Right? There's some
every time. Right? There's some variance, there's some stochasticity
variance, there's some stochasticity baked in. So what we're see what we're
baked in. So what we're see what we're seeing is you need a complimentary
seeing is you need a complimentary product that can handle repeatable
product that can handle repeatable deterministic units of work, right? So
deterministic units of work, right? So you can identify the agent trajectories
you can identify the agent trajectories that go well and then map them to
that go well and then map them to building blocks that users can interact
building blocks that users can interact with and execute over and over and over
with and execute over and over and over again.
again. >> Yeah, that's something we find is like
>> Yeah, that's something we find is like you know to do these things well you
you know to do these things well you need to be able to mix generative
need to be able to mix generative stoastic processes with deterministic
stoastic processes with deterministic reliable things and like the the blend
reliable things and like the the blend of that is interesting. Okay. So, uh,
of that is interesting. Okay. So, uh, essential domain complexity, you need to
essential domain complexity, you need to understand the domain. Really hard to
understand the domain. Really hard to like once you're in that domain and
like once you're in that domain and interacting with the rest of the world,
interacting with the rest of the world, getting it to actually do things
getting it to actually do things reliably is really hard. One of the
reliably is really hard. One of the things we, you know, we're talking about
things we, you know, we're talking about today, announcing today is that we're
today, announcing today is that we're tuning our own models. Like, that's the
tuning our own models. Like, that's the next bit. Even if you do that all
next bit. Even if you do that all perfectly, there is leverage, durable
perfectly, there is leverage, durable value in in tuning your own models for
value in in tuning your own models for parts of the system. Silus, I think
parts of the system. Silus, I think that's something that the cognition has
that's something that the cognition has done a lot of like that's part of like
done a lot of like that's part of like the was that part of the appeal of
the was that part of the appeal of acquiring wind surf like
acquiring wind surf like >> yeah so we um we think about training
>> yeah so we um we think about training models in a certain way so first of all
models in a certain way so first of all I do think there's this interesting new
I do think there's this interesting new development of application layered
development of application layered companies getting into model training as
companies getting into model training as um Finn is as well um I do think the
um Finn is as well um I do think the philosophy is a little bit different you
philosophy is a little bit different you know I think at a large foundation model
know I think at a large foundation model lab like OpenAI, Anthropic, there's
lab like OpenAI, Anthropic, there's these like separate research ors that do
these like separate research ors that do long-term research and sometimes quite
long-term research and sometimes quite far away from product. I think for us
far away from product. I think for us the product has always been the primary
the product has always been the primary goal. So we try to like go backwards
goal. So we try to like go backwards from the product and figure out where
from the product and figure out where does it make sense where would actually
does it make sense where would actually a custom model lift some like user
a custom model lift some like user metrics or enable like a new experience.
metrics or enable like a new experience. And there's um quite a few places across
And there's um quite a few places across the stack where we um found this to be
the stack where we um found this to be the case. So um and it is true that in
the case. So um and it is true that in winterf um there had been quite a few of
winterf um there had been quite a few of these. For example, um mo most famously
these. For example, um mo most famously the the tap model, right? Um we all know
the the tap model, right? Um we all know C-Pilot back in the day pioneered this,
C-Pilot back in the day pioneered this, but actually um Windsurf before it was
but actually um Windsurf before it was windfur was called Kodium and had one of
windfur was called Kodium and had one of the early products in that segment as
the early products in that segment as well. Um which later evolved to be um
well. Um which later evolved to be um like doing multi-line edits and tap to
like doing multi-line edits and tap to jump. Um so that's um continues to be a
jump. Um so that's um continues to be a very big um focus for us. But we also
very big um focus for us. But we also have SUI1 which is um basically our like
have SUI1 which is um basically our like frontier um coding agent model that was
frontier um coding agent model that was released in May and um is uh still one
released in May and um is uh still one of our most popular models and um
of our most popular models and um basically powered by like reinforcement
basically powered by like reinforcement learning and software engineering tasks
learning and software engineering tasks and um this is for us just the
and um this is for us just the beginning. There's a lot more to come on
beginning. There's a lot more to come on that front. Besides that, we also um see
that front. Besides that, we also um see a lot of potential for training
a lot of potential for training specialized models both for certain
specialized models both for certain verticals. We've um released for example
verticals. We've um released for example um the Kevin model. It's a little of a
um the Kevin model. It's a little of a um open source research project. We
um open source research project. We wrote a blog post and a paper about it
wrote a blog post and a paper about it where we trained a model on a specific
where we trained a model on a specific coding vertical which in this case was
coding vertical which in this case was CUDA kernel writing. But there are many
CUDA kernel writing. But there are many more of these verticals that um we uh
more of these verticals that um we uh work on with our enterprise customers.
work on with our enterprise customers. And the other specialization that we see
And the other specialization that we see is around speed. So very often um coding
is around speed. So very often um coding agents take minutes or even tens of
agents take minutes or even tens of minutes. And sometimes this is fine if
minutes. And sometimes this is fine if you just are like in a purely like
you just are like in a purely like delegation um mode and you maybe come
delegation um mode and you maybe come back half an hour later and review. But
back half an hour later and review. But very often also there is this desire to
very often also there is this desire to be in the loop and actually drive the um
be in the loop and actually drive the um the iteration of the agent. And for
the iteration of the agent. And for these cases we find that the difference
these cases we find that the difference between waiting like 45 seconds or 10
between waiting like 45 seconds or 10 seconds can be the difference between
seconds can be the difference between switching to another tab and scrolling
switching to another tab and scrolling Twitter or actually uh waiting for the
Twitter or actually uh waiting for the agent to be done. Um,
agent to be done. Um, >> so these are the these are the trails
>> so these are the these are the trails Brad was talking about like the the
Brad was talking about like the the different dimensions like the the one
different dimensions like the the one for us is um the latency budget we have
for us is um the latency budget we have on voice is wildly different the latency
on voice is wildly different the latency budget we have on email and we can do
budget we have on email and we can do very different things in terms of
very different things in terms of accuracy and cost trade-offs there. Um,
accuracy and cost trade-offs there. Um, and it's a it's a really hard problem
and it's a it's a really hard problem like yeah the fine tune things like one
like yeah the fine tune things like one of the things I personally get most
of the things I personally get most excited about and the stuff that
excited about and the stuff that Vertical was presenting was the latency
Vertical was presenting was the latency improvements and our ability to when we
improvements and our ability to when we have the model we can control the
have the model we can control the latency a lot more directly as well. Um,
latency a lot more directly as well. Um, is Harvey also like bought into this
is Harvey also like bought into this idea of like, hey, tuning models is
idea of like, hey, tuning models is there's an average.
there's an average. >> Yeah, so we've been doing this since I
>> Yeah, so we've been doing this since I joined two and a half years ago when the
joined two and a half years ago when the company was 6 months old. I think you
company was 6 months old. I think you bring up a good point actually, which is
bring up a good point actually, which is RFT is is super popular right now, but
RFT is is super popular right now, but distillation is still a very viable
distillation is still a very viable option for taking the reasoning
option for taking the reasoning capabilities of larger models,
capabilities of larger models, distilling them into smaller models that
distilling them into smaller models that can do the same task, but much cheaper
can do the same task, but much cheaper and much faster. Um, yeah, we've
and much faster. Um, yeah, we've certainly had our own model training
certainly had our own model training journey here. Actually when I joined um
journey here. Actually when I joined um the sort of state-of-the-art approach
the sort of state-of-the-art approach for customizing models at the time was
for customizing models at the time was continued pre-training or or
continued pre-training or or mid-training. Um so what we did is we
mid-training. Um so what we did is we literally took all of US case law uh
literally took all of US case law uh which is about somewhere between 10 and
which is about somewhere between 10 and 12 billion tokens and we literally did
12 billion tokens and we literally did next token prediction over that in the
next token prediction over that in the hopes that we would find uh sort of step
hopes that we would find uh sort of step change legal reasoning capabilities
change legal reasoning capabilities emerge in the same way that these other
emerge in the same way that these other reasoning capabilities emerge from the
reasoning capabilities emerge from the models. Um long story short it it worked
models. Um long story short it it worked in part. I don't think it was enough to
in part. I don't think it was enough to move them move the needle. Right? This
move them move the needle. Right? This is at the same time that RHF kind of
is at the same time that RHF kind of came around. You can gather, you know,
came around. You can gather, you know, really the thing that we want to
really the thing that we want to optimize for at the end of the day is
optimize for at the end of the day is lawyer preference of outputs, right? And
lawyer preference of outputs, right? And and specifically partner preference of
and specifically partner preference of outputs. Um if you can gather a few uh
outputs. Um if you can gather a few uh thousand uh examples, you can, you know,
thousand uh examples, you can, you know, train a reward model, do DPO um or do
train a reward model, do DPO um or do DPO directly on the preference judgments
DPO directly on the preference judgments um and you're off and running. Um, I
um and you're off and running. Um, I think today we're we're much more
think today we're we're much more focused on on RFT. I think what one
focused on on RFT. I think what one thing we're really interested in is if
thing we're really interested in is if you imagine this sort of emerging AI
you imagine this sort of emerging AI native stack of models where you're
native stack of models where you're doing inference model systems and tools
doing inference model systems and tools and then agents that operate on top of
and then agents that operate on top of them. Um, an area of investment for us
them. Um, an area of investment for us is actually beginning to simulate some
is actually beginning to simulate some of these legal tasks end to end. Right?
of these legal tasks end to end. Right? So imagine the fund formation process
So imagine the fund formation process that I described above. If you can
that I described above. If you can simulate this in in this sandboxed
simulate this in in this sandboxed environment, you can actually begin to
environment, you can actually begin to train agents uh to complete subtasks or
train agents uh to complete subtasks or the entire process. Um and to end uh the
the entire process. Um and to end uh the bottleneck there has and and uh
bottleneck there has and and uh continues to be strong uh verifiable
continues to be strong uh verifiable models, verifier models.
models, verifier models. >> I have to wrap us up.
>> I have to wrap us up. >> Nico, I just just really curious there.
>> Nico, I just just really curious there. Um the simulation is that like using an
Um the simulation is that like using an LLM as a simulator or is it more like a
LLM as a simulator or is it more like a reinforcement style playground sort of
reinforcement style playground sort of simulator? Yeah. So, what we're sort of
simulator? Yeah. So, what we're sort of envisioning is I described the workspace
envisioning is I described the workspace concept um just a few moments ago. Um
concept um just a few moments ago. Um you can allow an agent to take actions
you can allow an agent to take actions within that workspace including actions
within that workspace including actions the choice to interact with other humans
the choice to interact with other humans via email. Right.
via email. Right. >> Um so you we have tools essentially for
>> Um so you we have tools essentially for research, tools for drafting and tools
research, tools for drafting and tools for human interaction. Um that is
for human interaction. Um that is essentially an action space for an RL
essentially an action space for an RL agent. The application in this workspace
agent. The application in this workspace is the environment. Um, and we're
is the environment. Um, and we're looking at ways that we can scale up the
looking at ways that we can scale up the simulation to make it work in practice.
simulation to make it work in practice. I would say it's still, you know,
I would say it's still, you know, forwardlooking, you know, but it's
forwardlooking, you know, but it's something we're investing a lot in.
something we're investing a lot in. >> It's so exciting. I remember like
>> It's so exciting. I remember like learning about reinforcement learning,
learning about reinforcement learning, you know, way back when, like at a
you know, way back when, like at a reinforcement learning startup in 2014
reinforcement learning startup in 2014 or something like that. And, you know,
or something like that. And, you know, back then it was like multi-arm bandit
back then it was like multi-arm bandit for website optimization. Now it's like
for website optimization. Now it's like an entire line of business task could be
an entire line of business task could be in the environment or like software
in the environment or like software engineer is an environment. It's just
engineer is an environment. It's just wild. It's exciting.
wild. It's exciting. >> Uh the thing I'm going to wrap us up
>> Uh the thing I'm going to wrap us up now. The thing that I keep going back to
now. The thing that I keep going back to is your the Allen K quote. I'm just
is your the Allen K quote. I'm just thinking about like the depth at all the
thinking about like the depth at all the layers of this conversation from like
layers of this conversation from like the application layer, the air layer,
the application layer, the air layer, and then now into the model layer. Being
and then now into the model layer. Being able to have the expertise and the
able to have the expertise and the ability to own all of that unlocks
ability to own all of that unlocks these, you know, huge capabilities in
these, you know, huge capabilities in the product products that we're all
the product products that we're all building.
building. >> Okay. Uh, please put your hands together
>> Okay. Uh, please put your hands together for a great panel.
>> And thank you very much for coming this evening in general. Please give another
evening in general. Please give another round of applause for all of our great
round of applause for all of our great speakers, Brett, Molly, Fergle, Silus,
speakers, Brett, Molly, Fergle, Silus, and Nico.
And thank all of you online and here in person for joining us and giving us the
person for joining us and giving us the time. It's really great. If you're
time. It's really great. If you're interested in learning more about the
interested in learning more about the the technical work we've been doing
the technical work we've been doing today, uh finn.ai/ressearchonline
today, uh finn.ai/ressearchonline and in person here, we have technical
and in person here, we have technical session papers uh paper sessions uh
session papers uh paper sessions uh where you can talk to the scientists and
where you can talk to the scientists and engineers who worked on this stuff. Uh
engineers who worked on this stuff. Uh and maybe one favor is if you can go
and maybe one favor is if you can go back and talk to your uh customer uh
back and talk to your uh customer uh leaders about using Finn, that would be
leaders about using Finn, that would be really nice. Uh please hang around, have
really nice. Uh please hang around, have a drink. Be great to talk to you all.
a drink. Be great to talk to you all. Thank you very much.
[Music] that I have
that I have done.
heat. [Music]
I had moved on. [Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc