0:18 [music]
0:34 Typing thoughts into [music] the darkest
0:37 part becomes design. Words evolve
0:39 [music] to whispers meant for something
0:43 more divine. Syntax bends and breeze. I
0:45 see the language change. I'm not
0:49 instructing anymore. I'm rearranging
0:51 faith. Every loop I write [singing]
0:54 rewrites me. Every function hums with
0:57 meaning. I feel the interface dissolve
1:06 new code. Not on the screen but in the
1:08 soul where [music] thought becomes the
1:11 motion and creation takes control. No
1:15 lines no rules just balance [music] in
1:19 between the zero and the one. The
1:31 >> [music]
1:34 >> systems shape our fragile skin. They
1:36 mold [singing] the way we move. We live
1:39 inside the logic gates [music] of what
1:42 we think is true. But deep beneath the
1:44 data post, [music] there's something undefined.
1:45 undefined.
1:49 A [singing] universe compiling the image
1:51 of our [music] minds. Every line reveals
1:54 reflection. Every loop replace [music]
1:56 connection. We're not building, we're
1:59 becoming. And the code becomes confession.
2:07 This is the [music] new code. Not on the
2:10 screen, but in the soul with thought
2:13 becomes the motion. [music] Creation
2:16 takes control. No lines, no rules, just
2:20 balance in between the [music] zero and
2:24 the one. The silence and the dream. [music]
2:39 [music]
2:45 Don't worry. [music] Uh, we're just
2:46 giving you something to do while Codex
3:00 [music] Each prompt, each breath, each
3:04 fragile spin, a universe [music] renewing.
3:12 This is the new code.
3:14 Alive and [music] undefined.
3:17 Where logic meets emotion and structure
3:20 bends to mind. [music] The system hs
3:23 eternal but the soul writes the line. We
3:27 are the new code.
3:40 I'm fired inside. [music]
3:53 [applause]
3:56 Ladies and gentlemen, please join me in
3:57 welcoming to the stage the co-founder [music]
3:58 [music]
4:00 of Morning Brew and the managing partner
4:03 of 10X, your host for the leadership
4:06 [music] track session day, Alex Lieberman.
4:14 Keep it going. Let's get a quick read of
4:16 the room. If you are coming from right
4:18 here in the Big Apple from New York,
4:20 make some noise.
4:22 Okay, now I have to say it. I assume
4:24 this is the biggest group. San Francisco.
4:25 Francisco.
4:29 >> Wow, that is surprising. Uh, Austin.
4:31 >> Okay, we got Austin. Who thinks they
4:34 came from the furthest place and is in
4:35 the room today?
4:37 >> Where? Where?
4:40 >> Ecuador. Can anyone beat Ecuador? [applause]
4:41 [applause]
4:42 >> New Zealand.
4:44 >> I don't think anyone's going to beat New
4:47 Zealand. There we go. Well, first of
4:50 all, uh, I am so excited to welcome you
4:54 all to the AI Engineer Code Summit 2025.
4:56 Uh, I'm Alex Lieberman, co-founder of
5:00 Morning Brew and your MC for the day.
5:03 Um, now you may be wondering, why is a
5:06 newsletter guy hosting an AI engineer
5:08 conference? It's a great question. Well,
5:11 after I left my role at Morning Brew, I
5:13 asked myself one simple question, and it
5:16 was, what space do I want to spend my
5:18 time in for the next 20 years where I
5:20 can build something consequential and
5:22 spend my time with some of the smartest
5:25 people I've ever met? And the answer
5:27 became obvious. I wanted to be as close
5:29 to the frontier of AI as humanly
5:31 possible. Which is why I co-founded
5:33 10x.co, Co. which is an AI
5:35 transformation firm helping mid-market
5:37 and enterprise companies learn how to
5:39 use AI within their business. And I
5:41 spend basically all of my time now with
5:43 AI engineers like yourselves. I'm the
5:45 only non-technical person in the
5:46 business and I wouldn't have it any
5:50 other way. So as you know this year has
5:53 been a banner year for the industry. And
5:55 I would think of today as both a look
5:58 back on where we've been as well as a
6:00 tactical view of where we are headed in
6:03 companies small and large, old and new.
6:05 We're going to hear from the labs. We'll
6:07 hear from Unicorn AI startups. We'll
6:10 hear from academics, big-time management
6:13 consultants, and Fortune50 brands. But
6:15 before we do that, we have to give the
6:17 brands that made this day possible their
6:20 flowers. So, let's go into it. Let's
6:22 give it up for Google DeepMind. today's
6:25 presenting sponsor. [applause]
6:29 Love it. Keep it going for Anthropic,
6:32 the platinum sponsor for the day. [applause]
6:32 [applause]
6:34 And then one more round of applause for
6:37 all of the gold and silver sponsors who
6:39 you can meet in the expo downstairs
6:41 throughout the day. One more. Let's keep
6:47 Are you guys ready to do the damn thing?
6:49 >> Let's do it. To kick things off, let's
6:51 give a huge welcome to head of
6:53 engineering of the Clawed developer
6:56 platform, Caitlyn Les. Let's welcome
7:14 Good morning. Um, so first let's give a
7:16 huge thank you to Swix and the whole AI
7:18 engineer organizing team for bringing us
7:25 I'm Caitlyn and I lead the claw
7:27 developer platform team at Anthropic.
7:29 Um, so let's start with a show of hands.
7:31 Who here is integrated against an LLM
7:34 API to build agents?
7:36 Okay, I'm talking to the right people.
7:38 Love it. Um, so today I want to share
7:40 how we're evolving our platform to help
7:42 you build really powerful agentic
7:45 systems using claude.
7:47 So we love working with developers who
7:49 do what we call raising the ceiling of
7:51 intelligence. They're always trying to
7:52 be on the frontier. They're always
7:54 trying to get the best out of our models
7:56 and build the most high performing
7:58 systems. Um, and so I want to walk you
8:00 through how we're building a platform
8:01 that helps you get the best out of
8:03 cloud. Um, and I'm going to do that
8:05 using a product that you hopefully have
8:07 all heard of before. Um it's an agenda
8:09 coding product. We love it a lot and
8:15 So when we think about maximizing
8:17 performance um from our models, we think
8:19 about building a platform that helps you
8:21 do three things. Um so first the
8:23 platform helps you harness Claude's
8:25 capabilities. We're training Claude to
8:27 get good at a lot of stuff and we need
8:29 to give you the tools in our API to use
8:31 the things that Claude is actually
8:33 getting good at. Next, we help you
8:36 manage Claude's context window. Keeping
8:38 the right context in the window at any
8:40 given time is really really critical to
8:43 getting the best outcomes from Claude.
8:44 And third, we're really excited about
8:46 this lately. We think you should just
8:48 give Claude a computer and let it do its
8:50 thing. So I'll talk about how we're
8:52 we're evolving the platform to give you
8:53 the infrastructure and otherwise that
9:00 So starting with harnessing Claude's
9:02 capabilities. Um, so we're getting
9:04 Claude really good at a bunch of stuff
9:06 and here are the ways that we expose
9:08 that to you um in our API as ideally
9:11 customizable features. So here's a first
9:14 example um relatively basic. Claude got
9:16 good at thinking um and Claude's
9:19 performance on various tasks um scales
9:20 with the amount of time you give it to
9:23 reason through those problems. Um, and
9:25 so, uh, we expose this to you as an API
9:27 feature that you can decide, do you want
9:29 Claude to think longer for something
9:31 more complex or do you want Claude to
9:33 just give you a quick answer. Um, we
9:36 also expose this with a budget. Um, so
9:38 you can tell Claude how many tokens to
9:40 essentially spend on thinking. Um, and
9:42 so for cloud code, um, pretty good
9:44 example. Obviously, you're often
9:46 debugging pretty complex systems with
9:49 cloud code or sometimes you just want a
9:50 quick, um, answer to the thing you're
9:53 trying to do. And so, um, Claude Code
9:54 takes advantage of this feature in our
9:57 API to decide whether or not to have
10:00 Claude think longer.
10:02 Another basic example is tool use.
10:04 Claude has gotten really good at
10:07 reliably calling tools. Um, so we expose
10:09 this in our API with both our own
10:12 built-in tools like our web search tool,
10:14 um, as well as the ability to create
10:16 your own custom tools. You just define a
10:18 name, a description, and an input
10:20 schema. Um, and Claude is pretty good at
10:22 reliably knowing when to actually go um,
10:24 and call those tools and pass the right
10:26 arguments. So, this is relevant for
10:29 cloud code. Cloud code has many, many,
10:31 many tools and it's calling them all the
10:33 time to do things like read files,
10:36 search for files, write to files, um,
10:38 and do stuff like rerun tests and otherwise.
10:41 otherwise.
10:42 So, the next way we're evolving the
10:44 platform to help you ma maximize
10:46 intelligence from claude um, is helping
10:48 you manage Claude's context window.
10:50 Getting the right context at the right
10:52 time in the window is one of the most
10:53 important things that you can do to
10:56 maximize performance.
10:58 But context management is really complex
11:00 to get right. Um especially for a coding
11:03 agent like claude code. You've got your
11:04 technical designs, you've got your
11:06 entire code base. Um you've got
11:08 instructions, you've got tool calls. All
11:10 these things might be in the window at
11:12 any given time. And so how do you make
11:14 sure the right set of those things are
11:16 in the window? Um, so getting that
11:18 context right and keeping it optimized
11:19 over time is something that we've
11:22 thought a lot about.
11:25 So let's start with MCP model context
11:27 protocol. We introduced this a year ago
11:28 and it's been really cool to see the
11:32 community swarm around adopting um MCP
11:34 as a standardized way for agents to
11:37 interact with external systems. Um, and
11:40 so for cloud code, you might imagine
11:42 GitHub or Sentry. there are plenty of
11:44 places kind of outside of the agent's
11:46 context where there might be additional
11:48 information or tools or otherwise that
11:50 you want your agent to be able to
11:52 interact with or the cloud code agent to
11:54 be able to interact with. Um, and so
11:55 this will obviously get you much better
11:57 performance than an agent that only sees
11:59 the things that are in its window as a
12:05 Uh, so the next thing is memory. So, if
12:07 you can use tools like MCP to get
12:10 context into your window, we introduced
12:12 a memory tool to help you actually keep
12:14 context outside of the window that
12:16 Claude knows how to pull back into the
12:18 window only when it actually needs it.
12:20 Um, and so we introduced the first
12:22 iteration of our memory tool as
12:24 essentially a clientside file system.
12:26 So, you control your data, but Claude is
12:28 good at knowing, oh, this is like a good
12:30 thing that I should store away for
12:32 later. And then, uh, it knows when to
12:34 pull that context back in. So for cloud
12:37 code, you could imagine um your patterns
12:39 for your codebase or maybe your
12:41 preferences for your git workflows.
12:42 These are all things that claude can
12:45 store away in memory and pull back in
12:50 And so the third thing is context
12:52 editing. If memory helps you keep stuff
12:54 outside the window and pull it back in
12:57 when it makes sense, context editing
12:59 helps you clear stuff out that's not
13:00 relevant right now and shouldn't be in
13:02 the window. Um, so our first iteration
13:04 of our context editing is just clearing
13:07 out old tool results. Um, and we did
13:08 this because tool results can actually
13:10 just be really large and take up a lot
13:12 of space in the window. And we found
13:14 that tool results from past calls are
13:16 not necessarily super relevant to help
13:19 claude get good responses later on in a
13:20 session. And so you can think about for
13:23 cloud code, cla code is calling hundreds
13:26 of tools. Um, those files that it read
13:27 otherwise, all these things are taking
13:30 up space within the window. Um so they
13:32 take advantage of um context management
13:39 And so um we found that if we combined
13:42 our memory tool with context editing, we
13:46 saw a 39% bump in performance over over
13:49 the benchmark on our own internal evals.
13:51 Um which was really really huge. And so
13:52 it just kind of shows you the importance
13:54 of keeping things in the window that are
13:57 only relevant at any given time. And
13:59 we're expanding on this by giving you
14:01 larger context windows. So for some of
14:03 our models, you can have a million token
14:05 context window. Combining that larger
14:07 window with the tools to actually edit
14:09 what's in your window maximizes your
14:11 performance. Um, and over time, we're
14:12 teaching Claude to get better and better
14:14 at actually understanding what's in its
14:17 context window. So maybe it has a lot of
14:18 room to run, maybe it's almost out of
14:20 space. Um, and Claude will respond
14:23 accordingly depending on how much time
14:25 uh or how much room it has left in the window.
14:30 So, here's the third thing. Um, we think
14:32 you should give Claude a computer and
14:33 just let it do its thing. We're really
14:35 excited about this one. Um, because
14:37 there's a lot of discourse right now
14:39 around agent harnesses. Um, you know,
14:41 how much scaffolding should you have?
14:43 How opinionated should it be? Should it
14:46 be heavy? Should it be light? Um, and I
14:49 think at the end of the day, Claude has
14:50 access to writing code. And if Claude
14:53 has access to running that same code, it
14:54 can accomplish anything. you can get
14:56 really great professional outputs for
14:57 the things that you're doing just by
15:00 giving Claude runway to go and do that.
15:01 But the challenge for letting you do
15:03 that is actually the infrastructure as
15:05 well as stuff like expertise like how do
15:07 you give claude access to things that um
15:09 when it's using a computer it will get
15:12 you better results.
15:14 So a fun story is we recently launched
15:17 cloud code on web and mobile. Um and
15:18 this was a fun project for our team
15:20 because we had a lot of problems to
15:22 solve. When you're running cloud code
15:24 locally, cloud code is essentially using
15:27 your machine as its computer. But if
15:29 you're starting a session on the web or
15:31 on mobile and then you're walking away,
15:32 what's happening? Like where is that
15:34 where is um cloud code running? Where is
15:37 it doing its work? Um and so we had some
15:39 hard problems to solve. We needed a
15:41 secure environment for claude to be able
15:42 to write and run code that's not
15:45 necessarily like approved code by you.
15:47 Um we needed to solve or container
15:50 orchestration at scale. Um and we needed
15:52 session persistence um because uh we
15:54 launched this and many of you were
15:55 excited about it and started many many
15:57 sessions and walked away and we had to
15:59 make sure that um all of these things
16:01 were ready to go when you came back and
16:02 um wanted to see the results of what
16:05 Claude did.
16:08 So one key primitive in this is our code
16:10 execution tool. Um so we released our
16:13 code execution tool in the API um which
16:15 allows claw to run write code and run
16:17 that code in a secure sandboxed
16:20 environment. Um, so our platform handles
16:22 containers, it handles security, and you
16:23 don't have to think about these things
16:25 because they're running on our servers.
16:28 Um, so you can imagine deciding that um,
16:30 you you want Claude to write some code
16:32 and you want Claude to go and be able to
16:34 run that code. And for cloud code,
16:36 there's plenty of examples here. Um,
16:38 like make an animation more sparkly that
16:40 uh, you want Claude to actually be able
16:42 to run that code. Um, so we really think
16:44 the future of agents is letting the
16:46 model work pretty autonomously within a
16:47 sandbox environment and we're giving you
16:49 the infrastructure to be able to do that.
16:52 that.
16:54 And this gets really powerful once you
16:56 think about giving the model actual
16:58 domain expertise in the things that
16:59 you're trying to do. So we recently
17:01 released agent skills which you can use
17:04 in combination with our code execution
17:06 tool. Skills are basically just folders
17:09 of scripts, instructions, and resources
17:11 that Claude has access to and can decide
17:14 to run within its sandbox environment.
17:16 Um, it decides to do that based on the
17:18 request that you gave it as well as the
17:20 description of a skill. Um, and Claude
17:22 is really good at knowing like this is
17:24 the right time to pull this skill into
17:26 context and go ahead and use it. And you
17:29 can combine skills with tools like MCP.
17:31 So MCP gives you access to tools and
17:34 access to context. Um, and then skills
17:35 give you the expertise to actually make
17:37 use of those tools and make use of that
17:40 context. Um, and so for cloud code, a
17:42 good example is web design. Maybe
17:44 whenever you launch a new product or a
17:46 new feature, um, you build landing
17:47 pages. And when you build those landing
17:49 pages, you want them to follow your
17:51 design system and you want them to
17:53 follow the patterns that you've set out.
17:56 Um, and so Claude will know, okay, I'm
17:57 being told to build a landing page. This
17:59 is a good time to pull in the web design
18:02 skill. um and use the right patterns and
18:04 and design system for that landing page.
18:06 Uh tomorrow Barry and Mahes from our
18:08 team are giving a talk on skills.
18:10 They'll go much deeper and I definitely
18:14 recommend checking that out.
18:15 So these are the ways that we're
18:17 evolving our platform um to help you
18:19 take advantage of everything that Claude
18:21 can do to get the absolute best
18:22 performance for the things that you're
18:24 building. First, harnessing Claude's
18:27 capabilities. So, as our research team
18:29 trains Claude, we give you the API
18:30 features to take advantage of those
18:33 things. Next, managing Claude's context.
18:35 It's really, really important to keep
18:37 your context window clean with the right
18:40 context at the right time. And third,
18:41 giving Claude a computer and just
18:46 So, we're going to keep evolving our
18:49 platform. Um, as Claude gets better and
18:51 has more capabilities and gets better at
18:53 the capabilities it already has, we'll
18:55 continue to evolve the API around that
18:57 so that you can stay on the frontier and
18:59 take advantage of the best that Claude
19:04 has to offer. Um, second, as uh, memory
19:06 and context evolve, we're going to up
19:08 the ante on the tools that we give you
19:10 in order to let Claude decide what to
19:12 pull in, what to store away for later,
19:13 and what to clean out of the context
19:15 window. And third, we're really going to
19:18 keep leaning into agent infrastructure.
19:19 Some of the biggest problems with the
19:21 idea of just let Claude have a computer
19:23 and do its thing are those problems that
19:25 I talked about around orchestration,
19:27 secure environments, and sandboxing. And
19:29 so we're going to keep working um to
19:32 make sure that those are um ready for
19:35 you to take advantage of.
19:37 Um and I'm hiring. We're hiring at
19:39 Anthropic. We're really growing our
19:41 team. Um, and so if you're someone who
19:44 loves um, building delightful developer
19:46 products um, and if you're excited about
19:47 what we're doing with Claude, we would
19:50 love to work with you across end product
19:53 design um, Devril, lots of functions. So
19:56 please reach out to us
20:09 Our next [music] presenter is the
20:13 president and head of AI at Replet. He's
20:15 here to speak about building the future
20:17 of coding. Please join me in welcoming
20:32 All right, good morning everyone. So at
20:35 Replet we're building a coding agent for
20:38 nontechnical users. It's a very peculiar
20:39 challenge I would say compared to many
20:41 people in this room. And what I'm going
20:43 to talk about today is why autonomy has
20:46 become kind of the northstar that we
20:47 keep chasing you know since we launched
20:49 the very first version of rapid agent
20:52 September last year.
20:56 Let's start from this very interesting
20:59 plot in case my clicker worked which now
21:01 does. Um I'm sure you all have seen it.
21:04 you know the semiacing value of that
21:06 published by Zwix a few weeks ago and it
21:08 kind of clarified a bit the landscape
21:11 you know for all of us uh agent builders
21:14 on one hand you have the low latency
21:15 interactions that really allow you to
21:17 stay in the loop you know so you can do
21:19 deep work and focus really on the on the
21:21 coding task at hand but you need to be
21:23 an expert you need to know exactly what
21:25 to prom the model for and you need to
21:26 understand quickly if you want to accept
21:29 the changes or not then for several
21:31 months many of us including rapid We
21:34 kind of live in this I think value that
21:36 where the agent wasn't autonomous enough
21:39 to really delegate a task and come back
21:41 and see it accomplished but at the same
21:44 time it run long enough not to keep in
21:46 the zone not to keep in the loop likely
21:48 over time we managed to go all the way
21:50 on the right and now we have agents that
21:52 runs for several hours in a row. What
21:54 I'm going to be arguing with today and
21:56 hope is not going to stop inviting me to
21:58 this event is the fact that there is an
22:00 additional dimension like a third
22:02 dimension to this plot that you know it
22:04 hasn't been covered here and namely the
22:06 fact is how do we build autonomous
22:10 agents for nontechnical users.
22:12 So what I'm going to be arguing today is
22:14 that there are two types of autonomy.
22:17 One of it is more supervised. So think
22:20 of the you know Tesla FSD example. When
22:22 you sit in a Tesla, you're still
22:24 expected to have a driving license.
22:25 You're going to be sitting in front of
22:28 the steering wheel. Perhaps 99% of the
22:29 time, you're not going to use it, but
22:31 you're there in order to take care of
22:34 the longtail events. And similarly, a
22:36 lot of the coding agents that we have
22:38 today require you to be technically
22:41 savvy in order to use them correctly.
22:44 We at Reply and uh other companies at
22:46 this point are focusing on kind of the
22:48 Whimo experience for autonomous coding
22:51 agents. So you're expected to sit in the
22:53 back. You don't even have access to the
22:55 steering wheel. And I expect you
22:56 basically not to need any driving
22:59 license. Uh why is this important?
23:01 Because we want to empower every
23:03 knowledge worker to create software. And
23:05 I can't expect knowledge workers to know
23:07 what kind of technical decisions an
23:08 agent should be making. We should
23:10 offload completely the level of
23:12 complexity away from them.
23:14 Of course, it took a while to get here.
23:16 So I'm I'm sure what I'm showing you
23:18 here is something that all of you are
23:20 very familiar with. It took several
23:24 years to go from I know maybe less than
23:25 a minute feedback loop constant
23:27 supervision and talking about
23:28 completions and talking about
23:30 assistance. These are areas where the AI
23:33 power is and really been pioneering this
23:37 this type of user interaction. Then we
23:39 slowly climbed through you know higher
23:41 levels of autonomy. So we had the first
23:43 version of the agents based on on react.
23:45 So we concocted autonomy with a very
23:49 simple paradigm on top of LMS. Then
23:51 likely AI providers understood that tool
23:53 calling was extremely important poured a
23:55 lot of effort on that. So we built the
23:57 next version of agents with native tool
23:58 calling. And then I would say there is a
24:01 third generation of agents which I call
24:03 autonomous and that's when we started to
24:05 break the barrier of say one hour of
24:07 autonomy. Basically the the agent being
24:09 capable of running on long horizon tasks
24:12 and remaining coherent. It happens to be
24:13 the case that those are also the
24:14 versions of rapid agent that we launched
24:17 over the last year. So the B3 is the one
24:19 that we launched a couple of months ago
24:21 and it has exactly showcases those
24:24 properties. So the question for today is
24:26 can we actually build fully autonomous
24:29 agents and how do we get there.
24:32 So I'm going to try to redefine the
24:33 definition of autonomy today. I think
24:36 that often times we conflate autonomy
24:38 with a concept of something in the lungs
24:41 for a for a lot of time and usually as a
24:45 user you lose control. In reality what
24:47 the autonomy that I want to give to
24:50 agents can be very specifically scoped
24:53 and what I mean by that is especially
24:55 with rapid agent tree what we accomplish
24:57 is we we make sure that our agent takes
24:59 holy technical decisions. Of course,
25:02 that could lead to very long gap between
25:03 the different user interactions and in
25:05 case the agent again runs for several
25:07 hours. But this happens if and only if
25:09 the scope of the task you're giving to
25:12 the agent is really broad. And it turns
25:13 out that in reality you can have an
25:15 agent that is really autonomous and is
25:18 still fast as long as you give it a very
25:19 narrow scope for the task, you know, at
25:23 hand. So what we can accomplish in this
25:25 way is that the user still maintains
25:26 control on the aspects that they care
25:28 about and a user cares about what
25:30 they're building. Especially again our
25:31 users, knowledge workers, they don't
25:34 care about how something has been built.
25:35 They just want to see their goals to be
25:38 accomplished. So autonomy should not be
25:41 basically conflated with long run times.
25:44 And similarly, it shouldn't become a
25:46 dity metric. You know, a lot of us are
25:48 talking about it as a as a badge of
25:49 honor. And it's definitely been exciting
25:51 to see in the last few months that you
25:53 know many of us broke the barrier of uh
25:55 running several hours in a row. But I
25:58 think in terms of how to build agents
25:59 that are going to be more powerful and
26:01 more suitable in the future, we kind of
26:04 have to change a bit uh the the target
26:06 the metric that we that we keep in mind.
26:09 So think about it in this way. Tasks
26:11 have a natural level of complexity and
26:13 basically what we care about is that
26:15 they have a minimum irreducible amount
26:18 of work that they express. What agents
26:19 do is that they always go through this
26:21 loop of planning, implementing and
26:24 testing. And of course to make this
26:25 happen and to make it work correctly,
26:27 you want this work to be happening over
26:30 a long quing trajectory. So our goal is
26:33 to maximize the reducible runtime of the
26:36 agent. By reducible, I mean having a
26:37 span of time where the user doesn't have
26:40 to make any technical decisions and the
26:42 agent can accomplish the task again in
26:44 full autonomy. This is especially
26:46 important for us because I can't trust
26:48 our users to make technical decisions.
26:50 So they they need a proper technical
26:52 collaborator by their side. I want to
26:55 abstract away as much complexity as
26:56 possible from the process of software
27:00 creation. And last but not least, I want
27:02 the users to feel in control of what
27:05 they're creating without startling their
27:06 creativity because they have also to
27:08 think about the technical decision that
27:10 the agent is making.
27:13 So now what are the pillars of autonomy?
27:15 How are we making this happen? I would
27:17 say there are three pillars that are
27:19 extremely important to think about. The
27:21 first one is of course the capabilities
27:23 of frontier models like the baseline IQ
27:26 that we inject in the main agentic loop.
27:28 I'm going to leave this as an exercise
27:29 to the reader and to other people in the
27:31 room. I'm really glad a lot of you are
27:33 building amazing models that you know we
27:35 use all the time at Rabbit. So this is
27:37 the pillar number one. The second pillar
27:40 is verification. It's very important
27:43 that we test for local correctness of
27:45 our agent at every step that it takes
27:47 and the reason is fairly intuitive. If
27:48 you are building on very shaky
27:50 foundations, eventually the castle will
27:54 topple down. So we brought verification
27:56 in the loop to make sure that in a sense
27:57 you are having you know nines or
27:59 reliability whereing the compounding
28:01 errors that an agent will make
28:03 unavoidably if you know you don't put
28:05 any control on it. And last but not
28:07 least, you heard it on stage even
28:08 earlier. I'm sure are going to be
28:09 hearing this you know the entire day or
28:11 the entire duration of the conference.
28:14 Uh the importance of context management.
28:16 So on one end you want to have an agent
28:17 that is capable of being globally
28:19 coherent. So it's align with the intent
28:21 of the user the expectation of the user
28:23 but at the same time it is also to be
28:25 capable of managing both the high level
28:27 goal and the single task that the agent
28:29 is working on. I think we made amazing
28:31 progress in the last months on context
28:33 management. But I'm also excited to see
28:36 you know where we're going as a field.
28:38 Let's start from the first pillar that
28:40 we work actively at rapid which is verification.
28:42 verification.
28:45 So why did we focus on this? Over the
28:49 know last year we realize something that
28:51 I think each one of you has experienced.
28:53 So without testing agents build a lot of
28:56 painted doors. In our case the painted
28:58 doors are very visible because we create
29:00 a lot of web applications. So you end up
29:02 basically trying to click on a button
29:04 and the handler is not looked up or some
29:06 of the data that we're showing is
29:08 actually mock data and it's not coming
29:10 it's not coming from a database. But in
29:11 general this phenomenon spans you know
29:13 across every type of component you're
29:16 building being it front end or back end
29:17 a lot of components are actually not
29:21 fully fleshed uh by the agent. So we run
29:22 some evaluations internally. We found
29:25 out that more than 30% of the individual
29:27 features happen to be broken know the
29:29 first time that are cooked by the agent.
29:32 And that also means that almost every
29:34 applications have at least one broken
29:37 feature or painted door. They're hard to
29:40 find. The reason is users are not going
29:42 to spend time testing every single
29:44 button, every single field. And this is
29:47 also probably one of the reasons why a
29:49 lot of our users, especially the
29:51 nontechnical ones, still can't trust
29:53 coding agents very much. They are
29:54 shocked when they find that there is a
29:57 painted door out there. So, how do we
29:59 solve this problem?
30:01 Fundamentally, we need an agent must
30:03 gather all the feedback that they need
30:05 from their environment, right? It's
30:08 easier said than done. Um again
30:10 nontechnical users not only cannot make
30:12 technical decisions but also they cannot
30:14 provide the technical feedback that you
30:16 know an agent is required to make
30:18 progress and most what they can do is
30:20 basic you know quality assurance
30:22 testing. They can literally go around
30:24 the UI click interact with the
30:26 application. I'm I'm sure you have tried
30:28 it in your life. This is extremely
30:30 tedious to do and it leads to a very bad
30:32 user experience. And even though we
30:34 relied on that with our first release of
30:36 the agent last year, quickly we found
30:38 out that users don't want to spend time
30:40 doing testing. So we had to find a
30:42 complete, you know, orthogonal solution
30:45 to that which is autonomous testing and
30:47 it solves several different issues. The
30:50 first one is it breaks the feedback
30:52 bottleneck. Even if again we ask
30:54 feedback to the user, we were not given
30:56 enough of that. Now we don't have to
30:58 wait anymore for human feedback. we have
31:00 a way to elicit as much information as
31:03 possible from the app autonomously. We
31:05 also want to prevent the accumulation of
31:07 small errors. What I was saying before,
31:08 we don't want to have compounding errors
31:10 while the agent is building. And last
31:12 but not least, we have to overcome the
31:14 laziness of frontier models. So we need
31:16 to verify that whenever a model tells us
31:18 that a task has been completed, there is
31:20 actually the truth and that result is
31:23 not being elucinated.
31:25 There is a wide spectrum of code
31:27 verification that you know you you can
31:29 accomplish. I think we all started from
31:31 the very left. You know you have basic
31:33 study code analysis with LSPs. We have
31:35 been executing the code since we had
31:37 basically lams that were capable of
31:39 debugging and then we slowly started to
31:41 move towards the right. So generating
31:43 unit tests and running them it has a
31:45 limitation. It's limited only to
31:47 functional correctness. Uh unit testing
31:49 is not very powerful to do like proper
31:52 integration testing by definition. We
31:54 started also to do now API testing but
31:56 it's only limited to API code. So you
31:59 can test endpoint of an applications you
32:01 can't really test how a web app
32:04 functions and looks like and for this
32:07 reason in the last few months has and
32:09 other companies are putting a lot of
32:11 effort in really creating autonomous
32:13 testing based on the browser you know in
32:14 case the app that we're building is a
32:16 web application. There are two main
32:18 categories here. One is computer use.
32:20 It's a onetoone mapping with user
32:22 interface. So the model is directly
32:24 interacting with the application. It
32:26 requires screenshots. It tends to be
32:28 fairly expensive and fairly slow. I'm
32:31 sure you you tested it yourself. A good
32:33 way in the middle is browser use where
32:36 we simulate the user interface. You can
32:38 then interact with the browser and with
32:40 the web application and it relies on
32:41 basically accessing the DOM through abstractions.
32:44 abstractions.
32:46 So how do we how do we make this work in
32:49 Weblet? Um what we do is that we
32:51 generate applications that are amenable
32:54 to testing and we sort of merge
32:56 everything together from the previous
32:59 slides that I showed you. So we allow
33:01 the our testing agent to interact with
33:03 an application and gather screenshots in
33:05 case nothing has worked. So we have a
33:07 full back to computer use. But the vast
33:09 majority of times what we do is that we
33:11 have programmatic interactions with the
33:12 applications. So we interact with the
33:15 database, we read the logs, we do API
33:18 calls, we literally click on the app and
33:20 get back all the information that we
33:21 need. And by putting all of this
33:24 together, we collect enough feedback
33:27 that allows our agent both to make
33:29 progress and also to fix all the painted
33:32 doors that it encounters.
33:36 Just a know short technical deep dive on
33:38 how we accomplish this. I'm sure you
33:41 have seen a lot of the toolbased uh
33:43 browser use. There are amazing libraries
33:46 out there. First one comes to man stage
33:48 and the idea is that you have an agent
33:50 that has a few very generic tools
33:53 exposed. So know the agent can create a
33:56 new tab, can click, can fill forms etc
33:58 etc. The limitation here is that it's
34:00 difficult to enumerate all the different
34:02 type of interactions you could be having
34:04 with a browser. The problem of testing
34:07 is very similar to the Tesla analogy I
34:09 was making before. Maybe this cardality
34:12 of tools available is enough for 99% of
34:14 the interaction types. But then there is
34:17 always a long tail of idiosyncratic
34:18 interactions that a user makes with the
34:20 with a web application that are hard to
34:23 map into these tool these different tool
34:26 codes. So what we do uh in our case at
34:30 rapid is we directly write playrite code
34:32 and playwrite code is first of all very
34:35 manable for LLMs. LLMs are kind of
34:36 amazing at writing playright. You know
34:38 this is the experience that we had uh
34:40 since we started to work on this project
34:43 is also very powerful and expressive. So
34:45 in a sense it's a super set of what you
34:48 can express uh on the compared to the
34:51 left on the tools uh testing. And last
34:53 but not least, there is beauty in
34:55 creating playright code because you can
34:57 reuse those tests. The moment you write
34:59 a test in script, then you can rerun it
35:00 as many times as you want. So in a
35:02 sense, the moment you created a test,
35:04 you're also creating a regression test
35:06 suite that you can keep running in the
35:10 future. And all these kind of uh tricks
35:12 that I explained to you right now, they
35:14 helped us to create something that is
35:16 roughly a order of magnitude cheaper and
35:18 faster compared to computer use. And
35:20 we'll go back later on how important
35:22 latency is.
35:24 The second thing that the second pillar
35:25 that I wanted to talk about today of
35:26 course is context management. And I'm
35:29 going to go very fast here because I
35:30 think you're going to be hearing a lot
35:33 of talks today about it. The the high
35:36 level message here is that long context
35:38 models are not needed to work on quer
35:40 and long trajectories. Uh from
35:42 experience we found that most of the
35:44 tasks even the more ambitious one can be
35:47 accomplished within the 200,000 tokens.
35:49 So we're still not in a world where
35:52 working with models that have 10 million
35:54 or 100 million uh context windows is
35:56 necessary to actually run autonomous
35:59 agents. And we accomplish this by means
36:01 of learning how to do context management
36:04 correctly. So first of all, there are
36:06 several different ways to maintain state
36:09 which don't imply chucking all the state
36:11 into your context window. You can do
36:13 that for example by using the codebase
36:15 itself to maintain state. So you can
36:18 write documentation while the agent is
36:20 creating new code. You can also include
36:22 the plan description and all the
36:23 different task list that the agent is
36:25 working on. You can persist them on the
36:27 file system. So even there like have a
36:29 lot of ways to offload your memories.
36:30 And last but not least and this is
36:32 something I think you know Antropic has
36:35 been uh really evangelizing about um you
36:37 can even dump directly your memories in
36:39 the file system and then making sure
36:41 that your agent decides when to write
36:42 them back the moment they become
36:45 relevant to your work. So for this
36:46 reason we have been seeing a lot of
36:48 announcements in the last couple of
36:50 months. I just picked this one from
36:52 entropic you know with cloth sonet 4.7
36:56 so I wish 4.5 uh they have been able to
36:59 run uh focus task for more than 30 hours
37:01 in a row we have seen similar results
37:04 from open AI on the math problems. So I
37:06 think we we kind of broke the barrier of
37:08 running for long and you know being able
37:10 to have querant tasks.
37:12 I would say the key ingredient to make
37:15 this happen has been how good models
37:17 hand as agent builders have become in
37:20 doing sub agent orchestration. Subages
37:22 basically work by means of they are
37:24 invoked in the core loop. So it's a
37:26 completely it's starting from a blank
37:28 slate uh from a completely fresh
37:30 context. You as an agent builder decide
37:32 what subset of the context to inject
37:35 when this sub agent starts. And it's a
37:36 concept that is very similar I think to
37:38 everyone who's been writing software you
37:39 know in the last decades is separation
37:42 of concerns. So you decide what your sub
37:43 engine is going to be working on. You
37:44 give it the least possible amount of
37:46 context. You allow it to run to
37:48 completion. You only get the output the
37:50 results. You inject them back into the
37:52 main loop and you keep running in this
37:54 way. Of course it significantly improves
37:57 the number of memories per compression.
37:59 I just brought this plot from directly
38:02 from reputation run in production. The
38:04 moment we kicked in our new subvision
38:07 orchestrator on the ax on the y-axis you
38:09 can see the number of memories per
38:11 compression. So we went from roughly 35
38:16 to 4550 recently. So big improvement in
38:19 terms of how often we are recompressing
38:22 our context just because we can offload
38:24 a lot of the context pollution by means
38:27 of using sub aents.
38:29 I'm going to give an example where this
38:30 made the difference for us. You know
38:32 what I'm showing you here is more kind
38:34 of a cost optimization in a sense like
38:36 you're compressing less. You also have
38:38 separation of concerns which definitely
38:40 make your agent a bit smarter. In the
38:42 case of testing
38:45 working with sub agent was almost
38:46 mandatory for us and basically we
38:48 started to work on automated testing
38:50 even before we were very advanced in
38:52 terms of subent orchestration. And what
38:55 we found out is of course again as I was
38:57 saying before it makes things easier
39:00 better cost less pollution but when you
39:03 allow the main loop not only to create
39:05 code but also to do browser opt browser
39:08 actions to put back the observation of
39:10 your browser actions into the main loop
39:12 you tend to confuse the the hent loop
39:14 very much because at this point there is
39:15 a lot of heterogeneity in terms of the
39:17 action that your main loop is looking
39:20 at. So in order to make this work not
39:22 only we have to build all the playright
39:23 framework that I was showing to you
39:25 before but we also have to move our
39:27 entire architecture into sub agents. So
39:29 at this point you can see very clearly
39:30 why there is a separation of concern
39:33 here. Get the main agent loop running.
39:35 We decide at a certain point that it's
39:37 time to verify if the output of the
39:39 agent has been correct. We make this
39:41 happen all within a sub agent. Then we
39:42 scratch the context window of that sub
39:44 agent. We just return back the last
39:46 observation to the agent loop and then
39:49 we keep running in that way. So if
39:51 you're having issues today making your
39:53 sub agents uh work correctly, this is
39:55 one of the reasons why that you want to
39:57 take a look at.
40:00 So I think we covered the high level of
40:02 how to create more and more powerful uh
40:05 autonomous agents over time and I only
40:07 see us as a field becoming even more
40:09 proficient than that in the next months.
40:11 There is one additional ingredient
40:12 though that is going to make the
40:14 difference and it's parallelism. And I
40:16 will argue that parallelism is important
40:19 not because it's going to make agents
40:21 more powerful per se, but rather because
40:23 it's going to make the user experience
40:27 more exciting. So of course it is great
40:29 to have an agent that is capable of
40:31 running autonomously for long, but at
40:33 the same time it comes with the price of
40:34 making the user experience less
40:37 thrilling. You are not in the zone
40:39 anymore. What you do is that you write a
40:41 very long prompt. It's translated into a
40:44 task list. Uh and then you go to have
40:45 lunch with your colleagues and then you
40:47 come back and you hope that the agent is
40:48 done. That is not the kind of experience
40:50 that most of the productive people want
40:52 to have in life. You know, you want to
40:53 see as much work as done as possible in
40:56 the shortest span of time.
40:59 So what we do as a as a field at this
41:00 point has been to create parallel
41:03 agents. It's a very common trade-off
41:04 which by the way doesn't only apply to
41:06 agents. it it applies to computing in
41:09 general and for parallel agents what you
41:12 do is that you you trade off basically
41:14 extra compute in exchange for time. Why
41:16 there is this trade-off? So first of all
41:18 when you're running agents in parallel
41:21 you're gathering the same context in
41:23 multiple context windows. So every
41:25 single parallel agent that you will be
41:27 running probably shares say 80% of the
41:29 context across the board. So of course
41:32 you are just putting more computed work
41:34 because you're running those agents in
41:36 parallel. There is also another cost
41:39 that is kind of intangible for a lot of
41:40 you here in the room because I'm sure
41:43 you're all expert software developers.
41:45 But what do you do with the output of
41:47 multiple par agents at the end? Often
41:49 times you need to resolve merge
41:51 conflicts. So as a reminder, my users
41:53 don't even know what's the concept of
41:54 merge conflicts. It's something that I
41:58 have to figure out on our own. So the
41:59 current way in which we think of
42:01 parallel agents in in the space doesn't
42:04 really apply to rapid. Now at the same
42:05 time I still want to very much to
42:08 accomplish this. There are so many
42:10 interesting features that you can enable
42:11 with parallelism. Aside from the fact
42:14 that you can get more work done u at
42:16 times you want to you want testing to be
42:18 running in parallel with the agent that
42:20 creates code. Testing no matter how much
42:22 we optimize it is still very slow. If an
42:24 agent is only spending time on testing
42:26 users are not going to be engaging with
42:28 your application anymore. Um, at the
42:29 same time, it's also great to have a
42:31 synchronous process running while your
42:32 agent is running because you can inject
42:34 useful information back into the main
42:37 core loop. And last but not least is a
42:40 very common technique that we know boost
42:43 performance if you have enough budget to
42:45 do so. You should be sampling multiple
42:48 trajectories at the same time. So a lot
42:49 of perks are coming with parallel
42:52 agents. But u the way in which we
42:54 implement them today which I call
42:56 basically call user has an orchestrator
42:59 is the fact that tasks the parallel task
43:00 that you want to run are determined by
43:03 you by the user and each task is
43:05 dispatched in its own thread. So there
43:08 is a bit of manual process even the task
43:09 de composition in a sense is happening
43:11 in your mind while you're thinking about
43:14 which agents you want to run and then
43:16 the moment you get back all the results
43:17 you need to go through the problem of
43:20 merge conflicts and often times this is
43:22 not trivial at all no matter how many
43:24 amazing tools are out there. So what
43:27 we're working on today for our next
43:30 version of the agent is having the core
43:32 loop as the orchestrator. So the key
43:35 difference here is the fact that the the
43:36 subtask that we're going to be working
43:39 on are not determined by the user but
43:41 they are determined by the corion loop
43:43 and the parallelism is basically decided
43:46 on the fly. The agent does the task de
43:48 composition on behalf of the user and
43:50 this comes with a couple of advantages.
43:52 First of all again there's no cognitive
43:54 burden to for the user to understand how
43:57 they should be decomposing the task. At
43:59 the same time also there are ways in
44:03 which you can create tasks that sort of
44:05 mitigate the problem of merge conflicts.
44:07 I'm not claiming that we're going to be
44:09 able to mitigate it 100%. There are so
44:11 many corner cases in which merge
44:13 conflict will still represent a problem
44:14 but there are a lot of different
44:16 techniques known in software engineering
44:18 to make sure that you can try to have
44:20 multiple subage and not stepping on each
44:23 other toes. So the core loop as an
44:26 orchestrator is going to be the our main
44:29 bet for the next few months.
44:30 And in case you're passionate about
44:32 these topics,
44:34 [music] I'm always hiring a rabbit.
44:37 Thank you. [applause]
44:39 From transforming support tickets into
44:41 merge requests to helping teams ship
44:44 fixes faster than ever, our next
44:46 presenter has been at the center of
44:49 Zapier's AI agent journey. Please
44:57 [music] [applause]
45:04 Hello.
45:06 I'm so excited to tell you about how at
45:08 Zapier we are empowering our support
45:11 team to ship code. Before I tell you
45:14 about that, has anybody here visited the
45:16 Grand Canyon?
45:18 It's a good amount. Anybody rafted
45:21 through the Grand Canyon?
45:24 I see one person. I just got off an
45:26 18-day trip rafting through the Grand
45:28 Canyon over 200 miles. It was
45:31 incredible. No internet, no cell
45:33 service. The moment I got off, I found
45:36 out I was giving this talk. I didn't
45:39 think about uh work at all on the river,
45:41 but once I got off, I started thinking
45:43 about the parallels between the Grand
45:46 Canyon and Zapier. And we have one thing
45:50 in common and that is erosion.
45:52 Now natural erosion happens over
45:55 millions of years with wind, water and
45:58 time. It creates the beautiful canyon
46:02 that we experience and it's never
46:04 stopping, always continuing. At Zapier,
46:07 we have over 8,000 integrations built on
46:10 third party APIs and they are constantly
46:13 changing, which I'm now thinking of as
46:15 app erosion.
46:17 We've been around for 14 years. Some of
46:19 our apps are that old. API changes and
46:23 deprecations impact us and create
46:26 reliability issues. Again, it never stops.
46:27 stops.
46:30 So, I like to think of our apps as like
46:33 layers in the Grand Canyon, and they
46:35 need constant attention.
46:38 So, if we were to create our own Zapier
46:40 Canyon and our apps would be at the
46:43 walls, here's our support team flowing
46:46 down the middle watching out for app
46:50 erosion. And we have a backlog crisis.
46:52 Tickets were coming in faster than we
46:54 could handle them.
46:57 Creates integration reliability issues,
47:00 poor customer experience, even churn. So
47:02 to solve for app erosion, we kicked off
47:07 two parallel experiments. The first was
47:09 moving support from just triaging to
47:12 also fixing these bugs. It's experiment
47:15 number one. Experiment number two, we
47:18 were asking can AI help solve app
47:20 erosion faster.
47:23 So let's jump into experiment one. This
47:25 get kicked off two years ago, but had to
47:27 start with the why. We needed to get
47:29 that buy in to empower our support team
47:32 to ship code.
47:35 So apparosion is one of the major
47:38 sources of bugs coming through to from
47:41 support to engineering. So there's a big
47:44 need support is eager [laughter] for
47:47 this experience to a lot of them want to
47:48 go into engineering eventually and
47:51 unofficially many support members were
47:54 already helping to maintain our apps.
47:56 This moves us into how we started this
48:00 out. Put on some guard rails. We started
48:03 with just four target apps to uh focus
48:06 our fixes on. Engineering was set to
48:08 review any merge requests coming from
48:10 support and we kept the focus on app fixes.
48:12 fixes.
48:14 So jumping into experiment 2, this is
48:15 what I've been leading for the last
48:18 couple of years. How can we use codegen
48:20 to help solve for app erosion? And so
48:23 fortuitously, the name of this project
48:27 is Scout, which ties in so well to the
48:28 Grand Canyon experience that I've just
48:30 been through.
48:33 As any good product manager, we start
48:36 with discovery. We did some dog fooding,
48:39 so I shipped some app fixes. Uh we
48:41 shadowed engineers and support team
48:43 members as they were going through the
48:47 app fix process. We designed out uh what
48:48 are the pain points experienced along
48:51 the way, what are the phases of the work
48:54 and how much time is spent.
48:57 One big discovery we had is how much
49:00 time is spent gathering the context
49:03 going to the third party AP API docs
49:06 even crawling the internet looking for
49:08 information about a bug that's emerging
49:09 maybe somebody else has already
49:10 discovered and solved for it outside of
49:15 Zapier. internal context, logs, all of
49:17 this is a lot of context to go and
49:21 search for as a human uh and a lot to
49:24 gro and work through. This is something
49:28 we knew we needed to solve for.
49:33 where we started with all this great uh
49:36 opportunities and pain points is we
49:39 started building APIs that we believed
49:42 would solve for these individual um pain
49:46 points and some of these APIs are using
49:51 LLMs to you know for our diagnosis tool
49:53 gathering all that context on behalf of
49:56 the uh support person engineer and
49:58 curating that context building a
50:00 diagnosis that's [clears throat] using
50:03 an LLM. And then some aren't like we
50:06 have a unit test uh or unit test
50:08 generator is, but the um test case
50:11 finder is simply using a search query to
50:13 look for the right test cases to pull in
50:17 for your unit test. We built a bunch of
50:20 APIs. We had a bunch of great ideas. So
50:22 there was a lot for us to test with, but
50:24 we ran into some challenges in this
50:26 first phase. We had APIs but they were
50:30 not embedded into our engineers process.
50:33 So our tool I just said they don't like
50:36 to go to so many web pages to find all
50:38 their context. They would love all this
50:40 information to come to them. And yet our
50:42 web interface where we've we've created
50:45 a playground we call autocode internally
50:47 where you can come and play around with
50:51 our APIs. And our ask to the teams was
50:54 come try out our APIs and give us feedback.
50:56 feedback.
50:58 Now this is just one more window to go
50:59 to. So we didn't get a lot of
51:02 engagement. Also because we had shipped
51:06 so many uh APIs our team was spread
51:09 pretty thin. Cursor launched at the same
51:12 time which has gotten great adoption at
51:15 Zapier. We're all huge fans of cursor.
51:16 But from our side, it made some of our
51:20 tools no longer necessary.
51:21 But there was one major win in this
51:24 phase, which is one of our APIs became a
51:27 support darling. It's diagnosis. That
51:29 number one pain point of needing to go
51:31 out and find all of your context, curate
51:33 it for yourself so you can start solving
51:37 the problem. We were doing that on uh
51:39 the support team's behalf with the
51:42 diagnosis API
51:45 and support loved it enough that they
51:48 decided to embed it into their process.
51:49 They asked us to build a zap year
51:52 integration on our autocode APIs so they
51:55 could embed it into their zap that
51:57 creates the jur ticket from the support
52:02 issue and now diagnosis is included.
52:05 So embedding tools is the key to usage
52:07 as we find out. So how can we embed more
52:11 of our tools? Well, then MCP spins up
52:14 and that solves our problem.
52:19 We can now embed these API tools into
52:21 our engineers workflow. Specifically,
52:24 our engineers are pulling in these MCP
52:27 tools as they're using Cursor.
52:31 Our builders using Scout MCP tools are
52:34 leaving the IDE less, spending more time
52:36 in one window.
52:40 Still coming into challenges. One of our
52:42 uh our our key tool diagnosis
52:45 uh is so valuable to pull all that
52:48 context and to provide a recommendation,
52:51 but it takes a long time to run. Now, we
52:54 might run down that runtime. However, as
52:56 you're working synchronously on a ticket
52:58 in your ID, this was frustrating. We
53:00 also weren't keeping up with the
53:03 customization needs. Not only did MCP
53:05 launch and we started leveraging it, Zap
53:07 Your MCP launched too. And some of our
53:09 tools, if we weren't keeping up with the
53:12 customization needs, our engineers
53:16 internally looked to Zap Your MCP, which
53:17 is great. We're all on the same team
53:19 solving the same problem, but some of
53:22 our tools had a dead end. Also adoption
53:25 was scattered. We had a whole suite of
53:26 tools and we thought there was value in
53:28 each of them as it solves for different
53:32 problems across the different stages.
53:34 Not every engineer was using our tools
53:36 and if they were using tools, they're
53:39 only using a few of them. So we have
53:42 tool usage. We're happy about that. But
53:45 we were under the hypothesis that true
53:47 value is going to come from tying these
53:49 tools together.
53:51 So what if we owned orchestration of
53:54 these tools rather than saying here's a
53:56 suite of tools you use them as you wish
53:59 what if we combined them and created an
54:02 agent to orchestrate this. So this we
54:05 are calling scout agent. We take that
54:09 diagnosis run that against a ticket uh
54:11 use that information to actually spin up
54:14 a codegen tool which will then produce a
54:16 merge request using all the right context.
54:18 context.
54:20 So who would benefit the most from
54:22 orchestration? There are several
54:25 integration teams at Zapier who are
54:27 solving for these app fixes of various
54:29 levels of complexity and there's the
54:32 support team. So when we're saying who
54:33 should be our first customer scout
54:36 agent, we're thinking it should probably
54:39 be the the team fielding small bugs that
54:41 are emergent and coming hot off the
54:44 queue which is the support team. And now
54:47 our two experiments merge
54:49 and we have scout agent. We are building
54:52 for the support team.
54:54 And this is the flow of how it works.
54:57 Support is submitting an issue to scout
55:01 agent. We first categorize the issue. We
55:04 next assess its fixability.
55:07 Not every issue that comes from support
55:10 can be fixed. If we thinks it's fixable,
55:12 we'll move on to generating a merge
55:15 request. At that point, the support
55:17 team, this is the first time they're
55:18 picking up the ticket. It already has a
55:21 merge request attached to it. They'll
55:25 review and test. If it's not satisfying
55:28 what they believe is the actual solution
55:31 or the the what what the solution should
55:34 be to best address the customer's need,
55:36 they will make a request for an
55:37 adjustment that can happen right in
55:39 GitLab, which is where we do our work
55:41 and Scout will do another pass and
55:43 hopefully at that point we've gotten it
55:45 right and support can submit that MR for
55:48 review from engineering.
55:50 How we are running Scout, it's all
55:52 kicked off by a Zap. This is a picture
55:54 of one of our zaps. There are many zaps
55:56 that's run this whole process and it
55:58 embeds right into our support team's
56:00 zaps. We do a ton of dog fooding at Zapier.
56:02 Zapier.
56:05 We first run diagnosis and post that
56:07 result to the Jira ticket saying what
56:09 the categorization is. If we believe
56:11 it's fixable and then if we do believe
56:14 it's fixable, we then are kicking off a
56:17 GitLab CI/CD pipeline.
56:18 And we run three phases in that
56:22 pipeline. plan, execute and validate to
56:24 generate this merge request. The tools
56:28 used in this pipeline is Scoutm. So all
56:31 those APIs we invested in a year ago now
56:33 are really coming together and we're
56:36 orchestrating it uh within the GitLab
56:38 pipeline and we're also leveraging
56:41 cursor SDK.
56:42 Once the merge request has been
56:45 completed, we attach it to Jira and
56:47 support picks it up.
56:50 The latest addition to this is doing a
56:56 rapid iteration once a um uh once a
56:58 ticket has been posted with the merge
56:59 request and support team is looking at
57:01 it and they say you know it needs some
57:04 tweaks to save them more time so they
57:05 don't have to go pull that down to their
57:07 ID do the fixes and push it back up.
57:10 they can simply chat with the uh scout
57:14 agent in gitlab that'll kick off another
57:16 uh pipeline which does that phase with
57:19 that new feedback and posts the new
57:22 merge request
57:24 on our side we want to make sure scout
57:26 agent is working so we ask three
57:29 questions the categorization right is
57:31 was it actually fixable uh and was the
57:34 code fix accurate so far we have two eval
57:36 eval
57:39 to 75% accuracy for categorization and
57:42 fixibility. As we get more feedback and
57:44 process more tickets, those become our
57:46 test cases and we can move forward
57:50 improving scout agent over time. So what
57:52 has been scout agents impact on app erosion?
57:54 erosion?
57:58 40% of supports support teams app fixes
58:01 are being generated by scout. So we're
58:04 doing more of the work on behalf of the
58:06 support team.
58:08 This is resulting and for some of our
58:10 support team it's doubling their
58:12 velocity from one to two tickets per
58:14 week which already is amazing. That's
58:17 going from a support team that wasn't
58:19 shipping any fixes, well unofficially
58:21 they were sometimes to now shipping one
58:23 to two per week per person to now
58:25 shipping three to four with the help of Scout.
58:27 Scout.
58:30 Another uh process improvement, Scout
58:32 puts potentially fixable tickets right
58:36 there in the triage flow. takes away a
58:37 lot of the friction of looking for
58:40 something to grab from the backlog.
58:42 It's not just the support who's
58:44 benefiting, it's also engineering.
58:46 Engineering manager said, uh, it's a
58:49 great example of when it works. This
58:51 tool allows us to stay focused on the
58:53 more complex stuff.
58:55 And if you take away anything from this
58:57 talk, I hope it is that there is a
59:01 really powerful magic between support
59:03 and empowering them with codegen and
59:05 allowing them to ship fixes because they
59:07 have three superpowers. The first they
59:10 are the closest to customer pain which
59:12 mean they're closest to the context that
59:14 really matters for figuring out what's
59:16 the problem and how to solve it. They're
59:20 also troubleshooting in real time. These
59:22 tickets aren't stale. the context is
59:25 fresh, the logs aren't missing. You put
59:27 this ticket into engineering backlog
59:29 months later, you might not get access
59:32 to those logs anymore. And then three,
59:35 they're best at validation.
59:37 You've again you put the same ticket
59:40 into an engineering backlog. The
59:42 solution an engineer might come up with
59:44 may change the behavior and that might
59:47 be good for some customers but might not
59:49 necessarily be best for that one
59:53 customer who wrote in about the problem.
59:58 And one other major benefit of this is
60:00 uh support team members who have been part of this experiment are now
60:02 part of this experiment are now engineers.
60:05 engineers. I want to say thank you to the amazing
60:06 I want to say thank you to the amazing team who's helped built this process or
60:09 team who's helped built this process or built all the tools and the scout agent.
60:11 built all the tools and the scout agent. Andy is actually here in the audience.
60:14 Andy is actually here in the audience. So shout out to Andy. If you want to
60:15 So shout out to Andy. If you want to talk about any of the technical bits,
60:17 talk about any of the technical bits, he's here. And I want to impress upon
60:19 he's here. And I want to impress upon you two things. or hiring, but mostly if
60:23 you two things. or hiring, but mostly if you haven't rafted through the Grand
60:24 you haven't rafted through the Grand Canyon, please consider it. It's
60:26 Canyon, please consider it. It's lifechanging and you should go with ORS.
60:29 lifechanging and you should go with ORS. Thank you very much.
60:31 Thank you very much. [applause]
60:43 Our next presenters believe that [music] 2026
60:44 2026 is the year the IDE died. Please join me
60:48 is the year the IDE died. Please join me in welcoming to the stage engineering
60:50 in welcoming to the stage engineering leader at Source Graph and AMP, Steve JG
60:54 leader at Source Graph and AMP, Steve JG and author and researcher at IT
60:56 and author and researcher at IT Revolution, Jean Kim.
61:00 Revolution, Jean Kim. [music]
61:08 Hey everybody. Um, really happy to be here. I'm going to be talking the first
61:09 here. I'm going to be talking the first half. Co-author here, Jean Kim, is going
61:11 half. Co-author here, Jean Kim, is going to talk second half. All right. Looking
61:14 to talk second half. All right. Looking forward to it. Cheers. All right. Today
61:16 forward to it. Cheers. All right. Today I'm going to Well, we're going to talk
61:17 I'm going to Well, we're going to talk real fast. This time is going to go down
61:18 real fast. This time is going to go down fast. Uh I'm going to talk to you about
61:20 fast. Uh I'm going to talk to you about what tools look like next year. Last
61:23 what tools look like next year. Last year I was talking to you all about chat
61:25 year I was talking to you all about chat and everybody ignored me and now
61:27 and everybody ignored me and now everybody's using chat this year and
61:29 everybody's using chat this year and it's like we're gonna we're going to fix
61:31 it's like we're gonna we're going to fix that right now. All right. So here's
61:34 that right now. All right. So here's what it's looked like. I'm going to tell
61:36 what it's looked like. I'm going to tell you right now, everyone's in love with
61:38 you right now, everyone's in love with Cloud Code. There's probably 40
61:40 Cloud Code. There's probably 40 competitors out there. Cloud Code ain't
61:43 competitors out there. Cloud Code ain't it.
61:45 it. completions wasn't it. I love cloud
61:47 completions wasn't it. I love cloud code. I use it 14 hours a day. I mean,
61:49 code. I use it 14 hours a day. I mean, come on. But it ain't it. Developers
61:52 come on. But it ain't it. Developers aren't adopting it. I'm going to talk
61:53 aren't adopting it. I'm going to talk about why in this talk. I'm going to
61:54 about why in this talk. I'm going to talk about what you can do about it and
61:56 talk about what you can do about it and what to look forward to. But the reason
61:58 what to look forward to. But the reason is they're too hard. Okay. Uh cognitive
62:00 is they're too hard. Okay. Uh cognitive overhead. Uh they lie, cheat, and steal.
62:03 overhead. Uh they lie, cheat, and steal. Gene and I talk a lot about this in our
62:05 Gene and I talk a lot about this in our book, all the different ways that they
62:06 book, all the different ways that they can lie, cheat, and steal. And uh most
62:08 can lie, cheat, and steal. And uh most devs just don't like this.
62:12 devs just don't like this. I have come to understand that claude
62:14 I have come to understand that claude code is very much like a drill or a saw,
62:18 code is very much like a drill or a saw, an electric one, right? How much damage
62:21 an electric one, right? How much damage can you do as an untrained person with a
62:23 can you do as an untrained person with a drill, right? Or a saw. Yeah. How much
62:26 drill, right? Or a saw. Yeah. How much damage can you do as an untrained
62:28 damage can you do as an untrained engineer with clawed code? It's real
62:30 engineer with clawed code? It's real similar. Yeah. You can cut your foot
62:31 similar. Yeah. You can cut your foot off,
62:34 off, but you can also be really, really
62:36 but you can also be really, really skilled with it and do really precision
62:38 skilled with it and do really precision work, right? like a craftsman. The
62:41 work, right? like a craftsman. The problem is software is infinitely large.
62:44 problem is software is infinitely large. Our ambition is infinitely large. And so
62:46 Our ambition is infinitely large. And so the analogy that I want to share with
62:47 the analogy that I want to share with you is next year will be the year from
62:49 you is next year will be the year from moving from saws and drills to CNC
62:53 moving from saws and drills to CNC machines. A CNC machine, you strap a
62:56 machines. A CNC machine, you strap a drill on and you give it coordinates and
62:58 drill on and you give it coordinates and it moves it around and very precise,
63:00 it moves it around and very precise, right? We've been doing this for
63:02 right? We've been doing this for centuries and we're not going to stop
63:04 centuries and we're not going to stop this year.
63:09 One thing I hear people say is, "Well, the models are plateaued." This is real
63:11 the models are plateaued." This is real common. Your engineers are probably
63:13 common. Your engineers are probably saying this, okay, even if they
63:16 saying this, okay, even if they plateaued, we have still discovered
63:18 plateaued, we have still discovered steam and electricity and it's going to
63:20 steam and electricity and it's going to take us a little time to harness it. But
63:21 take us a little time to harness it. But it's strictly an engineering problem at
63:23 it's strictly an engineering problem at this point. All code within a year, year
63:27 this point. All code within a year, year and a half will be written by giant
63:29 and a half will be written by giant grinding machines overseen by engineers
63:32 grinding machines overseen by engineers who no longer actually look at the code
63:34 who no longer actually look at the code directly anymore.
63:37 directly anymore. Weird new world. That is where we are
63:39 Weird new world. That is where we are going. Oh my gosh. Yeah. This this
63:42 going. Oh my gosh. Yeah. This this slide. So Gene and I talked to Andrew
63:44 slide. So Gene and I talked to Andrew Glover who I don't know is he here from
63:45 Glover who I don't know is he here from OpenAI and he said that they have this
63:48 OpenAI and he said that they have this incredible dichotomy unfolding at OpenAI
63:50 incredible dichotomy unfolding at OpenAI where you know some percentage of their
63:52 where you know some percentage of their engineers are using codecs and then some
63:55 engineers are using codecs and then some other percentage a larger percentage are
63:56 other percentage a larger percentage are not using codecs and the difference in
63:58 not using codecs and the difference in productivity is so staggering that
64:00 productivity is so staggering that they're having now alarms going off at
64:03 they're having now alarms going off at performance review time because how do
64:04 performance review time because how do you compare these these two engineers
64:06 you compare these these two engineers who are the same level, same title, same
64:08 who are the same level, same title, same everything and one of them is 10 times
64:10 everything and one of them is 10 times as productive as the other one by any
64:12 as productive as the other one by any measure.
64:13 measure. And the answer is they're freaking out.
64:15 And the answer is they're freaking out. They may have to fire 50% of their
64:17 They may have to fire 50% of their engineers. And this is unfolding at
64:18 engineers. And this is unfolding at other companies, too.
64:21 other companies, too. Who is refusing it? It's the senior and
64:24 Who is refusing it? It's the senior and staff engineers. How many minutes are we
64:26 staff engineers. How many minutes are we at?
64:28 at? >> Eight [clears throat] minutes.
64:29 >> Eight [clears throat] minutes. >> We're perfect. This is just like what
64:32 >> We're perfect. This is just like what happened to the Swiss mechanical watch
64:35 happened to the Swiss mechanical watch industry over a couple of Well, it was
64:37 industry over a couple of Well, it was built up for a couple of centuries and
64:39 built up for a couple of centuries and then courts killed it, you know, within
64:40 then courts killed it, you know, within a couple of years. And what happened was
64:42 a couple of years. And what happened was the craftsmen were doing the same thing
64:44 the craftsmen were doing the same thing our staff engineers are doing today. No
64:47 our staff engineers are doing today. No cheap.
64:49 cheap. That's word for word, right? That's what
64:51 That's word for word, right? That's what they say.
64:54 they say. All right. I didn't know where to put
64:56 All right. I didn't know where to put this slide. This is this is Claude's
64:58 this slide. This is this is Claude's view of what next year looks like. And I
65:01 view of what next year looks like. And I I was just like, what do you think it's
65:02 I was just like, what do you think it's going to look like? And it actually does
65:03 going to look like? And it actually does kind of look like this. Most of the
65:04 kind of look like this. Most of the words will be spelled correctly in in
65:06 words will be spelled correctly in in next year. But this is a lot prettier
65:09 next year. But this is a lot prettier than cloud code.
65:11 than cloud code. Yeah, this is what it has to look like.
65:14 Yeah, this is what it has to look like. Some form of a UI, not an IDE. This is
65:19 Some form of a UI, not an IDE. This is the new IDE. Okay. And people are
65:21 the new IDE. Okay. And people are building it. In fact, I think the
65:23 building it. In fact, I think the company that's the furthest along in
65:25 company that's the furthest along in this is Replet, who just talked to you.
65:27 this is Replet, who just talked to you. I think it's amazing what they're doing.
65:28 I think it's amazing what they're doing. It's absolutely bravo, right? We should
65:31 It's absolutely bravo, right? We should not be all chasing tail lights and
65:33 not be all chasing tail lights and building command line interfaces
65:35 building command line interfaces anymore. All right. and and more
65:37 anymore. All right. and and more importantly, Claude Code and all of its,
65:40 importantly, Claude Code and all of its, you know, competitors, they're all doing
65:43 you know, competitors, they're all doing it wrong because they're building the
65:44 it wrong because they're building the world's biggest ant. Okay, this is my my
65:47 world's biggest ant. Okay, this is my my buddy Brendan Hopper at Commonwealth
65:48 buddy Brendan Hopper at Commonwealth Bank of Australia, right? He's like,
65:50 Bank of Australia, right? He's like, "Nature builds ant swarms and Claude
65:52 "Nature builds ant swarms and Claude Code built this huge muscular ant that's
65:54 Code built this huge muscular ant that's just going to bite you in half and take
65:55 just going to bite you in half and take all your resources, right? I mean, it's
65:57 all your resources, right? I mean, it's a serious problem, right? If I say
65:59 a serious problem, right? If I say please analyze this codebase, I, you
66:00 please analyze this codebase, I, you know, go to the expensive model." If I
66:02 know, go to the expensive model." If I say, "Is my git ignore file still
66:04 say, "Is my git ignore file still there?" I've also gone to the expensive
66:06 there?" I've also gone to the expensive model, right? Everything that you say
66:07 model, right? Everything that you say goes to the expensive model. So, what's
66:09 goes to the expensive model. So, what's going to happen? Whoa. What happened? Oh
66:11 going to happen? Whoa. What happened? Oh gosh,
66:13 gosh, my slides are all messed up now.
66:16 my slides are all messed up now. Can you guys see them?
66:18 Can you guys see them? >> No.
66:18 >> No. >> Oh, this always happens to me, man.
66:20 >> Oh, this always happens to me, man. There's something going on. All right.
66:22 There's something going on. All right. So, I thought of a really cool analogy
66:24 So, I thought of a really cool analogy called the diver the diver metaphor,
66:26 called the diver the diver metaphor, which is your context window is like an
66:27 which is your context window is like an oxygen tank. Okay. This is why these
66:30 oxygen tank. Okay. This is why these things are fundamentally wrong because
66:32 things are fundamentally wrong because you're sending a diver down into your
66:34 you're sending a diver down into your codebase underwater to swim around and
66:37 codebase underwater to swim around and take care of stuff for you. One diver
66:39 take care of stuff for you. One diver and we're like, we're going to give him
66:41 and we're like, we're going to give him a bigger tank. 1 million tokens. He's
66:44 a bigger tank. 1 million tokens. He's still going to run out of oxygen. Like
66:46 still going to run out of oxygen. Like you don't, right? You should send a
66:48 you don't, right? You should send a product manager diver down first
66:51 product manager diver down first and then a coding diver, right? And then
66:54 and then a coding diver, right? And then a review diver and a test diver and a
66:56 a review diver and a test diver and a get merge diver, etc. Right? Nobody's
66:58 get merge diver, etc. Right? Nobody's doing this. Everyone's building a bigger
67:00 doing this. Everyone's building a bigger diver. I don't know my slides are all
67:02 diver. I don't know my slides are all messed up. My my my talk is almost done.
67:04 messed up. My my my talk is almost done. But um what we do is as engineers, task
67:08 But um what we do is as engineers, task decomposition,
67:09 decomposition, successive refinement, components, black
67:11 successive refinement, components, black boxes. This is how it's going to be
67:13 boxes. This is how it's going to be built in the future. And it's going to
67:14 built in the future. And it's going to be built with lots and lots of agents,
67:17 be built with lots and lots of agents, not just one agent.
67:19 not just one agent. All right. Until then, I think we're out
67:21 All right. Until then, I think we're out of time, but so until then, learn cloud
67:23 of time, but so until then, learn cloud code. Give up your IDE. Swix told me he
67:26 code. Give up your IDE. Swix told me he wants some hot take, so I'll give you
67:27 wants some hot take, so I'll give you one. If you're using an IDE starting on,
67:31 one. If you're using an IDE starting on, I'll give you till January 1st.
67:34 I'll give you till January 1st. You're a bad engineer.
67:38 You're a bad engineer. There's your hot take. All right, folks.
67:41 There's your hot take. All right, folks. [applause]
67:45 All right, cheers. Well, that that was actually my talk. Um [clears throat]
67:47 actually my talk. Um [clears throat] uh uh learn coding agents and oh yeah,
67:49 uh uh learn coding agents and oh yeah, then there's this guy. Speaking of bad
67:51 then there's this guy. Speaking of bad engineers, so this is this is Jordan
67:54 engineers, so this is this is Jordan Hubard uh who uh who's at NVIDIA and he
67:56 Hubard uh who uh who's at NVIDIA and he tweeted LinkedIn a really nice post on
68:00 tweeted LinkedIn a really nice post on how to get the most out of agents and
68:01 how to get the most out of agents and this guy responded with this, right?
68:03 this guy responded with this, right? This is everyone in your or this is 60%
68:06 This is everyone in your or this is 60% of your org right here. This guy's not
68:08 of your org right here. This guy's not an outlier. Okay, the backlash is very
68:10 an outlier. Okay, the backlash is very real against this. Yeah. And this is
68:13 real against this. Yeah. And this is going to be a problem I'm not going to
68:14 going to be a problem I'm not going to I'm not going to share with you. I don't
68:15 I'm not going to share with you. I don't have time to share how to fix it, but
68:16 have time to share how to fix it, but it's something you should be aware of.
68:17 it's something you should be aware of. And anyway, I'm going to turn it over to
68:19 And anyway, I'm going to turn it over to my co-author, Jean. We had a lot to talk
68:21 my co-author, Jean. We had a lot to talk about. He's got a lot to go. So, let's
68:22 about. He's got a lot to go. So, let's turn it over to Jean.
68:23 turn it over to Jean. >> Yeah. Thank you, Steve.
68:24 >> Yeah. Thank you, Steve. >> Hi, buddy. [applause]
68:27 >> Hi, buddy. [applause] >> Yeah. By the way, um I have let me start
68:31 >> Yeah. By the way, um I have let me start off by introducing myself and then I
68:32 off by introducing myself and then I want to share a little bit about like
68:33 want to share a little bit about like what it's been working like uh what's
68:34 what it's been working like uh what's been like working with Steve on the VIP
68:36 been like working with Steve on the VIP coding book. Uh and so just a little bit
68:38 coding book. Uh and so just a little bit about myself. I've had the privilege of
68:39 about myself. I've had the privilege of studying high performing technology
68:40 studying high performing technology organizations for 26 years. And that was
68:43 organizations for 26 years. And that was a journey that started when I was a
68:44 a journey that started when I was a technical founder uh of a company called
68:46 technical founder uh of a company called Tripwire. I was there for 13 years. But
68:48 Tripwire. I was there for 13 years. But our mission was really to understand
68:49 our mission was really to understand these amazing high performing technology
68:51 these amazing high performing technology organizations. They had the best project
68:52 organizations. They had the best project due date performance and development,
68:53 due date performance and development, the best operational reliability and
68:55 the best operational reliability and stability and also the best posture
68:57 stability and also the best posture compliance uh security and compliance.
68:58 compliance uh security and compliance. So we want to understand how did those
69:00 So we want to understand how did those amazing organizations make their good to
69:01 amazing organizations make their good to great transformation. So we got
69:03 great transformation. So we got understand how to how other
69:04 understand how to how other organizations replicate those amazing
69:06 organizations replicate those amazing outcomes. And so you can imagine in that
69:07 outcomes. And so you can imagine in that 26 year journey there are many
69:08 26 year journey there are many surprises. Among the biggest surprise
69:10 surprises. Among the biggest surprise was how it took me into the middle of
69:11 was how it took me into the middle of the DevOps movement which is so uh
69:13 the DevOps movement which is so uh amazing because it reshaped technology
69:15 amazing because it reshaped technology organizations. you know, it changed how
69:16 organizations. you know, it changed how test and operations worked, information
69:18 test and operations worked, information security. Um, and I thought that would
69:20 security. Um, and I thought that would be the most exciting adventure I'd be on
69:22 be the most exciting adventure I'd be on in my career until I met Steve Yagi in
69:24 in my career until I met Steve Yagi in person. And so, I've admired his work
69:26 person. And so, I've admired his work for over 11 years. And so, some of you
69:28 for over 11 years. And so, some of you may have read this memo of Jeff Bezos's
69:31 may have read this memo of Jeff Bezos's most audacious memo of how in early
69:33 most audacious memo of how in early 2000s they transformed from a gigantic
69:35 2000s they transformed from a gigantic monolith that coupled 3,500 engineers
69:37 monolith that coupled 3,500 engineers together, so none of them had
69:38 together, so none of them had independent action. And uh he talked
69:41 independent action. And uh he talked about how all teams must henceforth
69:43 about how all teams must henceforth communicate and coordinate only through
69:44 communicate and coordinate only through APIs. No back doors allowed. Right? Uh
69:46 APIs. No back doors allowed. Right? Uh anyone who doesn't do this will be
69:47 anyone who doesn't do this will be fired. Thank you and have a nice day.
69:49 fired. Thank you and have a nice day. And the amazing person who chronicled
69:50 And the amazing person who chronicled says number seven is obviously a joke
69:53 says number seven is obviously a joke because Bezos doesn't care whether you
69:54 because Bezos doesn't care whether you have a good day or not. And this is
69:56 have a good day or not. And this is actually enforced by Amazon CIO then
69:58 actually enforced by Amazon CIO then Rick Del. And so it turns out this memo
70:00 Rick Del. And so it turns out this memo that I've been quoting for 11 years uh
70:02 that I've been quoting for 11 years uh was written by Steve Yaggi uh which was
70:04 was written by Steve Yaggi uh which was meant to be a private uh memo on Google+
70:07 meant to be a private uh memo on Google+ which was made public which landed him
70:09 which was made public which landed him on the front page of the Wall Street
70:10 on the front page of the Wall Street Journal. Um and so I finally met him in
70:13 Journal. Um and so I finally met him in uh June and it turns out that we had
70:15 uh June and it turns out that we had many things in common. Uh but one of
70:16 many things in common. Uh but one of them was this uh love of AI and this
70:18 them was this uh love of AI and this sense that AI was going to shape coding
70:21 sense that AI was going to shape coding from underneath us. And so one of our
70:24 from underneath us. And so one of our beliefs is that uh the AI will reshape
70:26 beliefs is that uh the AI will reshape technology organizations you know maybe
70:28 technology organizations you know maybe even 100 times larger than what agile
70:30 even 100 times larger than what agile cloud CI/CD and mobile did you know 10
70:33 cloud CI/CD and mobile did you know 10 years ago. Um and that these technology
70:35 years ago. Um and that these technology breakthroughs not just reshape
70:36 breakthroughs not just reshape organizations but they reshape the
70:37 organizations but they reshape the entire economy. the entire economy
70:39 entire economy. the entire economy rearranges itself to take advantages of
70:41 rearranges itself to take advantages of these you know wild new better ways of
70:43 these you know wild new better ways of uh uh producing things and and uh so
70:45 uh uh producing things and and uh so over the last year and a half we've had
70:47 over the last year and a half we've had a chance to look at these case studies I
70:48 a chance to look at these case studies I think give us a glimpse of what these uh
70:51 think give us a glimpse of what these uh what the shape of technology
70:52 what the shape of technology organizations look like and so I'm going
70:53 organizations look like and so I'm going to share with that what we've learned
70:55 to share with that what we've learned but here's maybe a hint so some of you
70:57 but here's maybe a hint so some of you may know the work of Aiden Cochra he was
70:58 may know the work of Aiden Cochra he was a cloud architect at Netflix right he
71:00 a cloud architect at Netflix right he was what who drove uh the uh entire
71:03 was what who drove uh the uh entire Netflix infrastructure from a data
71:04 Netflix infrastructure from a data center uh back in 2009 to running
71:07 center uh back in 2009 to running entirely in the as cloud and so he wrote
71:09 entirely in the as cloud and so he wrote uh some months ago in 2011 some people
71:12 uh some months ago in 2011 some people got very upset in uh infrastructure and
71:14 got very upset in uh infrastructure and operations because they called it
71:15 operations because they called it noopops right and everyone laughed back
71:17 noopops right and everyone laughed back then but he said oh don't you know uh
71:20 then but he said oh don't you know uh it's happening again this time it might
71:21 it's happening again this time it might be called no dev right not so funny now
71:24 be called no dev right not so funny now right so it's it's interesting right
71:26 right so it's it's interesting right because we heard this amazing
71:27 because we heard this amazing presentation from zapier about like how
71:29 presentation from zapier about like how support ships and turns out designers
71:31 support ships and turns out designers are shipping UX is shipping right anyone
71:33 are shipping UX is shipping right anyone who's been frustrated by developers uh
71:35 who's been frustrated by developers uh who you know say get in line and you
71:36 who you know say get in line and you have to wait quarters or years or maybe
71:38 have to wait quarters or years or maybe never, right, are now suddenly in
71:40 never, right, are now suddenly in position where you can actually vibe
71:41 position where you can actually vibe code your own features into production,
71:43 code your own features into production, right? And that reshapes technology
71:44 right? And that reshapes technology organizations and it reshapes, you know,
71:46 organizations and it reshapes, you know, potentially the entire economy. And so,
71:48 potentially the entire economy. And so, uh, uh, Steve and I, we've had the
71:49 uh, uh, Steve and I, we've had the privilege of watching what happens, you
71:51 privilege of watching what happens, you know, when we change, uh, you know, the
71:53 know, when we change, uh, you know, the way we, uh, deploy, right? It wasn't so
71:55 way we, uh, deploy, right? It wasn't so long ago and 10 years ago, uh, I wrote a
71:57 long ago and 10 years ago, uh, I wrote a book called the Phoenix Project where it
71:59 book called the Phoenix Project where it was all about the catastrophic
72:00 was all about the catastrophic deployment. Would you believe, uh, that
72:02 deployment. Would you believe, uh, that it was, you know, 10 years ago, 15 years
72:04 it was, you know, 10 years ago, 15 years ago, most organizations shipped once a
72:06 ago, most organizations shipped once a year, right? Right. And so I got to work
72:07 year, right? Right. And so I got to work on a project called the state of DevOps
72:09 on a project called the state of DevOps research. It was a cross population
72:10 research. It was a cross population study that spanned 36,000 respondents uh
72:13 study that spanned 36,000 respondents uh from 2013 to 2019. And what we found uh
72:16 from 2013 to 2019. And what we found uh this was Dr. Nicole Forsgrren and Jez
72:18 this was Dr. Nicole Forsgrren and Jez Humble. Um and what we found was that
72:20 Humble. Um and what we found was that these high performers ship multiple
72:21 these high performers ship multiple times a day, right? They can ship in one
72:23 times a day, right? They can ship in one hour or less. And you know back in 2009,
72:26 hour or less. And you know back in 2009, people thought, "Oh my gosh, multiple
72:27 people thought, "Oh my gosh, multiple deployments per day, right? That's
72:28 deployments per day, right? That's reckless and irresponsible, maybe even
72:30 reckless and irresponsible, maybe even immoral, right? What sort of maniac
72:31 immoral, right? What sort of maniac would deploy multiple times a day,
72:33 would deploy multiple times a day, right? And yet it's very common place
72:35 right? And yet it's very common place these days. In fact, if you want to have
72:36 these days. In fact, if you want to have great reliability profiles, if you want
72:37 great reliability profiles, if you want to have short meantime prepare, you have
72:39 to have short meantime prepare, you have to do smaller deployments more
72:40 to do smaller deployments more frequently. And I think we're now seeing
72:42 frequently. And I think we're now seeing these kind of case studies that show
72:43 these kind of case studies that show that this better way of coding, right,
72:45 that this better way of coding, right, where you don't type in code by hand
72:47 where you don't type in code by hand might be, you know, just a vastly better
72:49 might be, you know, just a vastly better way uh to create value. And so our
72:51 way uh to create value. And so our definition of vibe coding that we put
72:52 definition of vibe coding that we put into the uh V coding book was that it's
72:54 into the uh V coding book was that it's basically anything where you don't type
72:56 basically anything where you don't type in code by hand. And so for some of
72:58 in code by hand. And so for some of those of you who don't understand that,
72:59 those of you who don't understand that, that's like sort of a uh typing an ID
73:01 that's like sort of a uh typing an ID hunched over, right? And you're actually
73:02 hunched over, right? And you're actually moving your fingers, right? That's sort
73:04 moving your fingers, right? That's sort of like how some people go into a dark
73:06 of like how some people go into a dark room to develop photographs, right?
73:07 room to develop photographs, right? Believe it or not, some people still do
73:08 Believe it or not, some people still do that. Um and and what I that's a great
73:11 that. Um and and what I that's a great definition that we uh loved until uh Dar
73:14 definition that we uh loved until uh Dar Amade u uh CEO and co-founder of um
73:18 Amade u uh CEO and co-founder of um Anthropic, he gave us an even better
73:19 Anthropic, he gave us an even better definition, right? The vibe coding is
73:21 definition, right? The vibe coding is really the iterative conversation uh
73:23 really the iterative conversation uh that results in AI writing your code.
73:25 that results in AI writing your code. And he said it's on one hand a beautiful
73:27 And he said it's on one hand a beautiful term because it evokes this different
73:28 term because it evokes this different way of coding but he said it's also
73:31 way of coding but he said it's also somewhat misleading because it sounds
73:32 somewhat misleading because it sounds jokey right uh but he said you know
73:35 jokey right uh but he said you know adanthropic there's no other game in
73:36 adanthropic there's no other game in town right and I just thought that was
73:37 town right and I just thought that was just a beautiful way to evoke you know
73:39 just a beautiful way to evoke you know how important uh vibe coding is uh this
73:42 how important uh vibe coding is uh this is Dr. Eric Meyer um you he's probably
73:44 is Dr. Eric Meyer um you he's probably considered one of the greatest
73:45 considered one of the greatest programming language designers of all
73:47 programming language designers of all time. Uh he was part of Visual Basic, C
73:49 time. Uh he was part of Visual Basic, C link, Haskell. He created the hack
73:51 link, Haskell. He created the hack programming language uh that migrated
73:53 programming language uh that migrated millions of lines of code at Meta, you
73:55 millions of lines of code at Meta, you know, within a year uh bringing static
73:57 know, within a year uh bringing static type checking to a bunch of PHP
73:59 type checking to a bunch of PHP programmers and he said we are probably
74:01 programmers and he said we are probably going to be the last generation of
74:02 going to be the last generation of developers uh to write code by hand. So
74:05 developers uh to write code by hand. So let's have fun doing it. Um so one of
74:08 let's have fun doing it. Um so one of the things and uh when uh Steve and I
74:09 the things and uh when uh Steve and I started working on the book last
74:10 started working on the book last November was uh watching him spend
74:12 November was uh watching him spend hundreds of dollars a day on coding
74:14 hundreds of dollars a day on coding agents uh and just seemed so strange
74:17 agents uh and just seemed so strange right um you know and so he's maxing out
74:19 right um you know and so he's maxing out not just you know the uh the monthly
74:21 not just you know the uh the monthly subscriptions right but he's actually
74:23 subscriptions right but he's actually you know going way above and beyond that
74:25 you know going way above and beyond that and yet uh you know things that we're
74:27 and yet uh you know things that we're hearing now is that as an engineer part
74:29 hearing now is that as an engineer part of my job is that I need to be spending
74:30 of my job is that I need to be spending as much on tokens per day as my salary
74:33 as much on tokens per day as my salary right so you know that think about like
74:35 right so you know that think about like $500 to $1,000 a day, right? Because
74:37 $500 to $1,000 a day, right? Because this is the mechanical advantage, the
74:38 this is the mechanical advantage, the cognitive advantage that these tools are
74:40 cognitive advantage that these tools are giving us, right? And as an engineer,
74:41 giving us, right? And as an engineer, right, I'm going to challenge myself,
74:42 right, I'm going to challenge myself, you know, to get that kind of value to
74:44 you know, to get that kind of value to deliver value to people who matter. Um,
74:46 deliver value to people who matter. Um, and so in the book, we talk about, you
74:48 and so in the book, we talk about, you know, why people would do this, right?
74:50 know, why people would do this, right? And [snorts] the, uh, acronym we came up
74:51 And [snorts] the, uh, acronym we came up with FAFO, right? Uh, the most obvious
74:54 with FAFO, right? Uh, the most obvious one is F for faster, right? Yeah, that's
74:56 one is F for faster, right? Yeah, that's obviously true, but I think it's the
74:57 obviously true, but I think it's the most superficial and um part of why we
75:01 most superficial and um part of why we do this because uh the second one is it
75:03 do this because uh the second one is it lets us do more ambitious things, right?
75:06 lets us do more ambitious things, right? Uh the impossible becomes possible. Uh
75:08 Uh the impossible becomes possible. Uh so that's one end of the spectrum. On
75:10 so that's one end of the spectrum. On the other end of the spectrum, you know,
75:11 the other end of the spectrum, you know, the uh the tedious and small tasks
75:13 the uh the tedious and small tasks become free. One of the things I uh the
75:16 become free. One of the things I uh the uh interview of the cloud code team that
75:18 uh interview of the cloud code team that I just loved was uh I think it was
75:20 I just loved was uh I think it was Katherine she said um uh one of the
75:22 Katherine she said um uh one of the things we've noticed is that you know
75:24 things we've noticed is that you know when customer issues come up uh instead
75:26 when customer issues come up uh instead of putting them on a jur backlog and you
75:28 of putting them on a jur backlog and you know arguing about it in the grooming
75:30 know arguing about it in the grooming sessions and so forth right we just fix
75:31 sessions and so forth right we just fix it on the spot right and ship to
75:33 it on the spot right and ship to production or whatever um you know
75:34 production or whatever um you know within 30 minutes right and so yes it
75:36 within 30 minutes right and so yes it gets recorded but you know that whole
75:38 gets recorded but you know that whole sort of coordination cost you know just
75:40 sort of coordination cost you know just disappears right so again the impossible
75:41 disappears right so again the impossible becomes possible right and uh the
75:44 becomes possible right and uh the annoying things just become free. The
75:46 annoying things just become free. The second A is uh um you know the ability
75:49 second A is uh um you know the ability to do things alone or more autonomously,
75:52 to do things alone or more autonomously, right? And so um you know there's really
75:54 right? And so um you know there's really two coordination costs are being
75:56 two coordination costs are being alleviated here. One is you know if you
75:58 alleviated here. One is you know if you ever have to wait for a developer or a
76:00 ever have to wait for a developer or a team of developers, you know, to do what
76:02 team of developers, you know, to do what you need to do, right? You have to
76:04 you need to do, right? You have to communicate and coordinate and
76:05 communicate and coordinate and synchronize and prioritize and cajul and
76:07 synchronize and prioritize and cajul and escalate, you know, do all sorts of
76:09 escalate, you know, do all sorts of things to get them to care about the
76:10 things to get them to care about the problem just as much as you do, right?
76:12 problem just as much as you do, right? And you know now you know with these
76:14 And you know now you know with these amazing new miraculous technologies you
76:16 amazing new miraculous technologies you can do them by yourself right so that's
76:18 can do them by yourself right so that's one coordination co uh tax the other one
76:20 one coordination co uh tax the other one is that even if you get someone to uh
76:22 is that even if you get someone to uh care about a problem as much as you uh
76:24 care about a problem as much as you uh they can't read your mind right and what
76:26 they can't read your mind right and what we're finding is that these LLMs are
76:27 we're finding is that these LLMs are just amazing intermediation vehicles
76:29 just amazing intermediation vehicles right um you know just through an LLM
76:32 right um you know just through an LLM you can coordinate with other functional
76:34 you can coordinate with other functional specialties right through a markdown
76:35 specialties right through a markdown file right that's not the end right but
76:37 file right that's not the end right but it's just this amazing way uh to have
76:39 it's just this amazing way uh to have these high bandwidth coordination so
76:41 these high bandwidth coordination so that you can essentially read each
76:42 that you can essentially read each other's minds, you know, because shared
76:44 other's minds, you know, because shared outcomes require shared goals and shared
76:45 outcomes require shared goals and shared understanding. The second F is fun,
76:48 understanding. The second F is fun, right? As that Steve says, vibe coding
76:50 right? As that Steve says, vibe coding is addictive. This is so true. I mean, I
76:52 is addictive. This is so true. I mean, I cannot I think what I love about the
76:54 cannot I think what I love about the book is that it's a story about two guys
76:56 book is that it's a story about two guys who both thought their best days of
76:57 who both thought their best days of coding were behind them, right? And
76:59 coding were behind them, right? And found that, you know, it's entirely the
77:01 found that, you know, it's entirely the opposite. Um, and I've had so much fun
77:04 opposite. Um, and I've had so much fun and uh, you know, I'm having to force
77:05 and uh, you know, I'm having to force myself to go to sleep at night because
77:07 myself to go to sleep at night because otherwise I'd be up till 2 or 3 in the
77:09 otherwise I'd be up till 2 or 3 in the morning every night. uh and you know so
77:11 morning every night. uh and you know so it's not all great but it certainly
77:13 it's not all great but it certainly beats being boring or tedious or you
77:15 beats being boring or tedious or you know horrible and then optionality you
77:18 know horrible and then optionality you know one of the things that uh I love
77:19 know one of the things that uh I love about Swiss is that he has a shared love
77:21 about Swiss is that he has a shared love of creating option value and he told us
77:23 of creating option value and he told us last night that option value is also
77:25 last night that option value is also important for poker players right
77:26 important for poker players right because you never want to paint yourself
77:27 because you never want to paint yourself in a corner so option value is um one of
77:30 in a corner so option value is um one of the biggest creators of economic value
77:33 the biggest creators of economic value right modularity the reason why it's so
77:35 right modularity the reason why it's so powerful is because it creates option
77:37 powerful is because it creates option value uh and so just the fact that you
77:38 value uh and so just the fact that you can have so many more swings of bat can
77:40 can have so many more swings of bat can do so many more parallel experiments,
77:41 do so many more parallel experiments, right? This is what v coding allows. So
77:43 right? This is what v coding allows. So this is gives us confidence that you
77:45 this is gives us confidence that you know this is not just uh this is a very
77:47 know this is not just uh this is a very powerful tool. Um uh here's the quote
77:50 powerful tool. Um uh here's the quote from Andy Glover that uh Steve Yaggi
77:52 from Andy Glover that uh Steve Yaggi said is that you know as um for people
77:55 said is that you know as um for people who have this aha moment and and
77:56 who have this aha moment and and positioned uh you know I think the
77:58 positioned uh you know I think the instinct is how do we elevate everyone's
78:00 instinct is how do we elevate everyone's productivity to be as productive as you
78:02 productivity to be as productive as you are now being um you know that since
78:04 are now being um you know that since you've had your aha moment. So uh let me
78:08 you've had your aha moment. So uh let me share with you maybe some of our top
78:10 share with you maybe some of our top kind of uh exciting case studies that
78:12 kind of uh exciting case studies that kind of give us a hint of the future. So
78:14 kind of give us a hint of the future. So uh I've run into this conference called
78:15 uh I've run into this conference called the enterprise technology leadership
78:16 the enterprise technology leadership summit for uh 11 years now and Swix we
78:19 summit for uh 11 years now and Swix we had uh the honor of having Swix there
78:21 had uh the honor of having Swix there talking about the rise of the AI
78:23 talking about the rise of the AI engineer just this amazing
78:24 engineer just this amazing prognostication. Uh this year we had a
78:26 prognostication. Uh this year we had a series of amazing uh case studies. One
78:28 series of amazing uh case studies. One was uh Bruno Pasos. He spoke this year
78:30 was uh Bruno Pasos. He spoke this year uh last year at this conference and he
78:32 uh last year at this conference and he presented on uh their in their evolving
78:35 presented on uh their in their evolving experiment to elevate developer
78:36 experiment to elevate developer productivity across 3,000 developers. Um
78:39 productivity across 3,000 developers. Um and this is at Booking.com, the world's
78:41 and this is at Booking.com, the world's largest travel agency and they're
78:43 largest travel agency and they're finding that they're getting double-
78:44 finding that they're getting double- digit increase in productivity, right?
78:45 digit increase in productivity, right? Uh mergers are going in quicker, peer
78:48 Uh mergers are going in quicker, peer review times are uh smaller and and so
78:50 review times are uh smaller and and so forth, right? And so that's just we feel
78:52 forth, right? And so that's just we feel like that's a incomplete view of uh what
78:55 like that's a incomplete view of uh what people are achieving. Uh this is Shri
78:57 people are achieving. Uh this is Shri Balakrishnan. uh he was head of product
78:59 Balakrishnan. uh he was head of product and technology at uh Travelopia. Uh so
79:01 and technology at uh Travelopia. Uh so they're a $ 1.5 billion a year uh travel
79:04 they're a $ 1.5 billion a year uh travel company and one of the things that uh he
79:06 company and one of the things that uh he said is that uh you know they were able
79:08 said is that uh you know they were able to uh replace a legacy application uh in
79:11 to uh replace a legacy application uh in six weeks with a pair of uh with a very
79:13 six weeks with a pair of uh with a very small team. In fact, one of his uh
79:15 small team. In fact, one of his uh conclusions is that before we would need
79:17 conclusions is that before we would need a team of eight people to do something
79:19 a team of eight people to do something meaningful, right? Six developers, a UX
79:21 meaningful, right? Six developers, a UX person and a product owner. and he said
79:23 person and a product owner. and he said maybe these days it might be two a
79:25 maybe these days it might be two a developer and you know a a domain expert
79:27 developer and you know a a domain expert in other words as Kent Beck said a
79:29 in other words as Kent Beck said a person with a problem and a person who
79:30 person with a problem and a person who can solve it right maybe maybe a pair of
79:34 can solve it right maybe maybe a pair of those teams right and so that's going to
79:35 those teams right and so that's going to reshape I think you know how they can go
79:38 reshape I think you know how they can go further and faster uh so again maybe
79:40 further and faster uh so again maybe just a hint of what teams will look like
79:42 just a hint of what teams will look like in the future this is the one that
79:43 in the future this is the one that excites me most this is Dr. top pal uh
79:46 excites me most this is Dr. top pal uh he helped drive the DevOps move in at
79:47 he helped drive the DevOps move in at Capital One um and he's now at uh
79:50 Capital One um and he's now at uh Fidelity and so um among other things he
79:54 Fidelity and so um among other things he owns an application uh that is the
79:56 owns an application uh that is the application you go to ask which
79:58 application you go to ask which applications you know the 25,000
79:59 applications you know the 25,000 applications there have log 4J right and
80:02 applications there have log 4J right and uh it's his team and he's had this
80:04 uh it's his team and he's had this vision of what this application should
80:06 vision of what this application should look like uh but every time he asked
80:08 look like uh but every time he asked like can can we build it his team would
80:09 like can can we build it his team would say it would take about five months
80:11 say it would take about five months right and we'd hire need to hire a a
80:13 right and we'd hire need to hire a a front-end person and he got so
80:14 front-end person and he got so frustrated that he spent five days just
80:17 frustrated that he spent five days just vibe coding it by himself right uh you
80:19 vibe coding it by himself right uh you know directly accessing read only the
80:21 know directly accessing read only the Neo4j uh database um and put it into
80:24 Neo4j uh database um and put it into production right and so I think we're
80:25 production right and so I think we're seeing a world where um you know leaders
80:29 seeing a world where um you know leaders even leaders with their own teams are
80:31 even leaders with their own teams are frustrated saying hey I can do this uh
80:33 frustrated saying hey I can do this uh can I do this better myself not better
80:35 can I do this better myself not better just can I prove that it can be done and
80:37 just can I prove that it can be done and uh by the way what happened afterwards
80:39 uh by the way what happened afterwards um he was looking around who can help me
80:40 um he was looking around who can help me maintain my application production and
80:42 maintain my application production and all the senior engineers like not
80:44 all the senior engineers like not So enter uh Swathy the most junior
80:47 So enter uh Swathy the most junior engineer on the team uh who is helping
80:49 engineer on the team uh who is helping maintain this application and probably
80:50 maintain this application and probably outarning you know everybody in the
80:52 outarning you know everybody in the organization
80:54 organization uh and interestingly uh he he's also
80:56 uh and interestingly uh he he's also getting more headcount because the
80:58 getting more headcount because the number of consumers of this application
80:59 number of consumers of this application just increased by 10fold right so who
81:01 just increased by 10fold right so who saw that coming right um so uh here's
81:05 saw that coming right um so uh here's John Rouser he's senior director of
81:06 John Rouser he's senior director of engineering at Cisco security and he
81:08 engineering at Cisco security and he convinces SVP to um require 100 of the
81:12 convinces SVP to um require 100 of the top leaders inside of Cisco security to
81:14 top leaders inside of Cisco security to vibe code one feature into production in
81:16 vibe code one feature into production in a quarter that ended last month, right?
81:19 a quarter that ended last month, right? And so um you know we're actually
81:21 And so um you know we're actually getting a chance to be able to survey
81:23 getting a chance to be able to survey those people, right? Who finished? Uh
81:25 those people, right? Who finished? Uh you know uh how many completed, didn't
81:28 you know uh how many completed, didn't complete, partially completed, etc. And
81:30 complete, partially completed, etc. And of those who completed, right, what was
81:32 of those who completed, right, what was what aha moment did they have as a
81:34 what aha moment did they have as a leader? What's the magnitude and
81:36 leader? What's the magnitude and direction of what they want to do? And
81:37 direction of what they want to do? And so we're going to go in and study that.
81:38 so we're going to go in and study that. And I just I my prediction is that we're
81:40 And I just I my prediction is that we're going to see parts of that organization
81:42 going to see parts of that organization get reshaped as leaders realize kind of
81:45 get reshaped as leaders realize kind of what's possible. Everything from
81:46 what's possible. Everything from strategy to processes and so forth. And
81:49 strategy to processes and so forth. And so let me just share with you one um you
81:51 so let me just share with you one um you know thing that really excites me which
81:53 know thing that really excites me which is uh I got a chance to uh get back into
81:55 is uh I got a chance to uh get back into the state of DevOps research the Dora
81:57 the state of DevOps research the Dora study with uh u the Google cloud team
82:00 study with uh u the Google cloud team and one of the things that didn't make
82:01 and one of the things that didn't make into the report that I just found really
82:03 into the report that I just found really exciting was around this. It was like
82:06 exciting was around this. It was like what how much do people trust AI? And
82:08 what how much do people trust AI? And we're using a very strange definition of
82:10 we're using a very strange definition of trust, which is to what degree can I
82:12 trust, which is to what degree can I predict how the other party will act and
82:13 predict how the other party will act and react, right? Because the more you trust
82:15 react, right? Because the more you trust the other party, right, you can give
82:16 the other party, right, you can give them bigger requests, you can use fewer
82:18 them bigger requests, you can use fewer words, you have less need for feedback,
82:20 words, you have less need for feedback, right? It's the whole notion of finger
82:21 right? It's the whole notion of finger spits and fuel, right? Like you know,
82:23 spits and fuel, right? Like you know, how many of the 10,000 hours that
82:25 how many of the 10,000 hours that requires to be good at anything have you
82:26 requires to be good at anything have you used to get good at AI? And one of the
82:29 used to get good at AI? And one of the stunning findings was that it's this
82:31 stunning findings was that it's this line. So on the x-axis is how long have
82:33 line. So on the x-axis is how long have you been using AI tools? Y is how much
82:36 you been using AI tools? Y is how much do you trust it? Right? And the longer
82:37 do you trust it? Right? And the longer you use AI, right, the more you trust
82:40 you use AI, right, the more you trust it, right? So every every person who
82:41 it, right? So every every person who says, "I tried it and it's terrible at
82:43 says, "I tried it and it's terrible at coding," right? On what basis did they
82:46 coding," right? On what basis did they make that conclusion after maybe using
82:48 make that conclusion after maybe using for an hour or two? And what this shows
82:51 for an hour or two? And what this shows us is that uh you know it requires
82:52 us is that uh you know it requires practice, right? And this is probably a
82:54 practice, right? And this is probably a teachable skill. Um so length of time on
82:58 teachable skill. Um so length of time on the x-axis is a very incomplete
83:00 the x-axis is a very incomplete expression, right? It's like frequency
83:01 expression, right? It's like frequency and intensity and how many hours, but
83:03 and intensity and how many hours, but it's there's signal there. So it just
83:05 it's there's signal there. So it just shows that uh you know part of your job
83:07 shows that uh you know part of your job is to help other people have the aha
83:09 is to help other people have the aha moment and then help them you practice
83:11 moment and then help them you practice right so they get very very good at it
83:13 right so they get very very good at it so they can use every one of these
83:14 so they can use every one of these amazing technologies to achieve their
83:17 amazing technologies to achieve their goals. So uh I'll leave you with one
83:20 goals. So uh I'll leave you with one last kind of vision. Stephen and I we
83:22 last kind of vision. Stephen and I we did a vibe coding workshop for leaders
83:24 did a vibe coding workshop for leaders um back six weeks ago and what was
83:27 um back six weeks ago and what was amazing to me was in the 3 hours we had
83:31 amazing to me was in the 3 hours we had a 100% completion rate. Everyone built
83:33 a 100% completion rate. Everyone built something, you know, they built a data
83:34 something, you know, they built a data visualization tool. In fact, uh one
83:36 visualization tool. In fact, uh one person uh built a an iOS app and another
83:40 person uh built a an iOS app and another person actually got it into the review
83:41 person actually got it into the review queue in the Apple iOS app store, right?
83:44 queue in the Apple iOS app store, right? Which is which is absolutely
83:45 Which is which is absolutely astonishing. Uh and here's a guy named
83:47 astonishing. Uh and here's a guy named Roger Safner. He said, "I used to be a C
83:50 Roger Safner. He said, "I used to be a C MVP way back in the day. I haven't coded
83:52 MVP way back in the day. I haven't coded in 15 years." Uh and he's showing off an
83:55 in 15 years." Uh and he's showing off an app that helped him automate the process
83:57 app that helped him automate the process of getting checked in to Southwest
83:58 of getting checked in to Southwest Airlines until the bot detection tools
84:00 Airlines until the bot detection tools cut him off. But look at look at the
84:02 cut him off. But look at look at the expression on his face. And so I think
84:03 expression on his face. And so I think uh what we're seeing is like what
84:04 uh what we're seeing is like what happens when support ships right and
84:07 happens when support ships right and support codes and ships when leaders
84:08 support codes and ships when leaders code and ship. And there's no doubt in
84:09 code and ship. And there's no doubt in my mind that this will reshape uh
84:11 my mind that this will reshape uh technology organizations. If you're one
84:13 technology organizations. If you're one of those, Stephen and I want to talk to
84:14 of those, Stephen and I want to talk to you, right? Because you are on the
84:15 you, right? Because you are on the frontier of something really, really
84:17 frontier of something really, really important. I'll share with you a couple
84:18 important. I'll share with you a couple quotes. Here's a technology leader. When
84:20 quotes. Here's a technology leader. When I told my team that I wrote an app that,
84:22 I told my team that I wrote an app that, you know, an AI wrote 60,000 lines of
84:24 you know, an AI wrote 60,000 lines of code and I haven't looked at any of it,
84:26 code and I haven't looked at any of it, they all looked at me as if they wished
84:27 they all looked at me as if they wished I were dead.
84:30 I were dead. Um, we've uh we've had these stupid
84:32 Um, we've uh we've had these stupid problems in legacy applications that
84:34 problems in legacy applications that have been there for over a decade. We
84:36 have been there for over a decade. We got a group of senior engineers
84:37 got a group of senior engineers together. We used AI to generate a fix
84:39 together. We used AI to generate a fix and we submitted PR and the team
84:41 and we submitted PR and the team accepted it. Right? Unlike the time when
84:43 accepted it. Right? Unlike the time when they said it was AI generated and they
84:45 they said it was AI generated and they rejected it as AI slop, right? So this
84:48 rejected it as AI slop, right? So this is maybe happening in your
84:49 is maybe happening in your organizations. Um, our code velocity is
84:51 organizations. Um, our code velocity is so high. Uh, we've concluded that we can
84:53 so high. Uh, we've concluded that we can only have one engineer per repo, right?
84:55 only have one engineer per repo, right? Because of merge conflicts, right? We
84:58 Because of merge conflicts, right? We haven't figured out the coordination
84:58 haven't figured out the coordination cost uh mechanism yet. And so like all
85:01 cost uh mechanism yet. And so like all these were some of the lessons that went
85:02 these were some of the lessons that went into the vibe coding book. Thank you for
85:04 into the vibe coding book. Thank you for everyone who were at the signing
85:05 everyone who were at the signing yesterday. And uh if you're interested
85:07 yesterday. And uh if you're interested in any of the talks we referenced in
85:09 in any of the talks we referenced in excerpts of our book in uh basically uh
85:12 excerpts of our book in uh basically uh all the links that uh are in this
85:14 all the links that uh are in this presentation, just send an email to real
85:16 presentation, just send an email to real gene cams.com
85:17 gene cams.com subjectline vibe and you'll get an
85:19 subjectline vibe and you'll get an automated response in a minute or two.
85:20 automated response in a minute or two. So with that, Steve and I thank you for
85:22 So with that, Steve and I thank you for your time and we were around all week.
85:24 your time and we were around all week. Thanks all. [applause]
85:35 [music] >> Ladies and gentlemen, please welcome
85:37 >> Ladies and gentlemen, please welcome back to the stage, Alex Lieberman.
85:41 back to the stage, Alex Lieberman. [music] Let's give it up again for
85:42 [music] Let's give it up again for Steven Jean and also the rest of the
85:45 Steven Jean and also the rest of the speakers from the morning session.
85:47 speakers from the morning session. Whether you are watching in person or on
85:50 Whether you are watching in person or on YouTube or on the AIE site, you've been
85:54 YouTube or on the AIE site, you've been breaking a mental sweat. So, we are
85:55 breaking a mental sweat. So, we are going to take a 30 minute break, get
85:57 going to take a 30 minute break, get some grub, get some coffee, recharge,
85:59 some grub, get some coffee, recharge, and we will see you back here at 11.
86:01 and we will see you back here at 11. Thanks everyone. Appreciate it.
86:04 Thanks everyone. Appreciate it. [applause]
86:25 Two flames lit the darkness, burning side by [music and singing] side. Both
86:27 side by [music and singing] side. Both sworn to creation. Both relentless in
86:30 sworn to creation. Both relentless in their stride. One walked through the
86:33 their stride. One walked through the mountains, [music] one soared across the
86:35 mountains, [music] one soared across the void. Both chasing the horizon of the
86:39 void. Both chasing the horizon of the worlds they would deploy. [music]
86:41 worlds they would deploy. [music] But the path is not a straight line. And
86:44 But the path is not a straight line. And the future is not flat. Some rules bend
86:47 the future is not flat. Some rules bend through [music] space time and some
86:49 through [music] space time and some break [singing] on impact. Effort is a
86:53 break [singing] on impact. Effort is a kingdom. [music]
86:54 kingdom. [music] Leverage is the key. One builds the
86:56 Leverage is the key. One builds the throne by hand. One shapes [music]
86:59 throne by hand. One shapes [music] reality.
87:07 There is a curvature of time. Not [music] a race, not a throne, but a
87:10 [music] a race, not a throne, but a shift in the dimension of how progress
87:13 shift in the dimension of how progress becomes known. [music] When the universe
87:16 becomes known. [music] When the universe is standing to the will inside the mind,
87:19 is standing to the will inside the mind, you don't win by moving faster. You win
87:23 you don't win by moving faster. You win by [music]
87:24 by [music] breaing
87:26 breaing time.
87:39 Holes of the past try to drag the present [music] down. Systems built on
87:42 present [music] down. Systems built on dust [singing] wearing yesterday as
87:44 dust [singing] wearing yesterday as [music] crown. Some are pulled beneath
87:47 [music] crown. Some are pulled beneath them, [singing] fighting gravity alone.
87:50 them, [singing] fighting gravity alone. Others learn to map the edges and escape
87:54 Others learn to map the edges and escape events horizons. Not all power [music]
87:57 events horizons. Not all power [music] is struggle. [singing] Not all mastery
87:59 is struggle. [singing] Not all mastery is pain. The ones who change direction.
88:02 is pain. The ones who change direction. Rewrite the laws of the game. You can
88:06 Rewrite the laws of the game. You can live your life in [music and singing]
88:07 live your life in [music and singing] labor or an impact that compounds. Every
88:11 labor or an impact that compounds. Every second can be linear or worth a thousand
88:14 second can be linear or worth a thousand rounds.
88:21 There is a curvature of time, [music] not a race, not a throne, but a shift in
88:25 not a race, not a throne, but a shift in the dimension of how [music] progress
88:27 the dimension of how [music] progress becomes known. When the universe is
88:30 becomes known. When the universe is bending to the will inside the mind, you
88:34 bending to the will inside the mind, you don't win by moving [music] faster. You
88:37 don't win by moving [music] faster. You win by rediding
88:40 win by rediding [music] time. [singing]
88:53 The future isn't [music] distant. It accelerates [singing] for those who
88:55 accelerates [singing] for those who wield the tools of power. Instead of
88:58 wield the tools of power. Instead of fighting with their goals, mastery is
89:01 fighting with their goals, mastery is leverage, [music] not a sentence carved
89:03 leverage, [music] not a sentence carved in stone. The horizon does not move
89:06 in stone. The horizon does not move [singing] unless you. [music]
89:16 There is a car of time [music] where the present multiplies where a lifetime
89:19 present multiplies where a lifetime holds a legacy that no clock [music] can
89:22 holds a legacy that no clock [music] can quantify. Not by force, not by fury, but
89:26 quantify. Not by force, not by fury, but by evolution inside. We become eternal
89:30 by evolution inside. We become eternal beings when [music] we synchronize
89:34 beings when [music] we synchronize with
90:18 footstep. their fade, but they never die. Shadows stretch across the sky.
90:29 A whisper grows into a [singing] roar. Do you [music] feel it? Do you want
90:32 Do you [music] feel it? Do you want more?
90:42 Every heartbeat a stone [music] in the street.
90:49 Ripples [music and singing] chasing an endless dream.
91:00 What we do in life [music] echoes in eternity.
91:03 eternity. Every spark ignit [music]
91:41 >> [music] >> Reach out to the [singing] empty air.
91:44 >> Reach out to the [singing] empty air. Trace the stars like they're waiting
91:46 Trace the stars like they're waiting there.
91:48 there. [music]
91:54 The clock ticks but the moment stays forever starts in a single [singing]
91:56 forever starts in a single [singing] prayer.
92:32 >> Every heartbeat [music] stone [singing] in the stream.
92:43 Ripples [music and singing] chasing an endless dream.
92:55 What we do in life echoes in eternity. There sparking lights a fire that will
92:58 There sparking lights a fire that will never see [music]
93:50 >> Shadows [music] crawl where the light won't stay.
93:54 crawl where the light won't stay. The echo whispers don't [music] look
93:57 The echo whispers don't [music] look away.
93:59 away. Heartbeat racing louder than my doubt.
94:03 Heartbeat racing louder than my doubt. Scream inside. I can't let
94:06 Scream inside. I can't let [music and singing] out. But I won't
94:08 [music and singing] out. But I won't fall. I won't drown in the storm all
94:12 fall. I won't drown in the storm all around.
94:19 Fear of the [music] mind. I won't let you in. It seems like a
94:23 I won't let you in. It seems like a ghost, [music] but I keep it within
94:27 ghost, [music] but I keep it within the mind.
94:29 the mind. I'm breaking [music] the chain.
94:55 >> [music] >> Cold winds how but they won't define me.
94:59 >> Cold winds how but they won't define me. The cracks in my soul let the light find
95:03 The cracks in my soul let the light find [music] me. Every step I take the ground
95:06 [music] me. Every step I take the ground fights back. But I'm the fire. I'm the
95:08 fights back. But I'm the fire. I'm the spark. I'm the attack. [music]
95:16 I won't freeze. I won't fade. Through the chaos I've remained.
95:23 [music] Fear is a killer. I won't let it win. It
95:28 Fear is a killer. I won't let it win. It creeps like a ghost, but I keep it
95:31 creeps like a ghost, but I keep it within.
95:33 within. Fear is a killer. I'm breaking [music]
95:36 Fear is a killer. I'm breaking [music] the chain. Don't for
96:10 >> [music] [singing]
96:19 [music] >> I hear the static in the [singing]
96:21 >> I hear the static in the [singing] night. It calls.
96:24 night. It calls. A whisper [music] rising,
96:27 A whisper [music] rising, breaking through the walls.
96:30 breaking through the walls. [music]
96:32 [music] Electric echoes in my veins. Stay home.
96:36 Electric echoes in my veins. Stay home. Chasing the shadows where [music] the
96:39 Chasing the shadows where [music] the wild ones run.
96:47 The air is still the weight is [music and singing] gone. Close your
96:49 [music and singing] gone. Close your eyes. The past is done.
96:59 Free your mind. Let it go. Let [music] it break the chains. Heat. Heat.
97:43 >> Waves come crash against the sky. Friends of a dream. [music]
97:46 Friends of a dream. [music] I see them inside.
97:53 Gravity is a [music] story. We don't need a weather thunder where the speed.
97:58 need a weather thunder where the speed. [music]
98:03 [music] The air is the weight is gone. Close
98:07 The air is the weight is gone. Close your eyes. The [singing] past is done.
98:15 [singing] Free your mind. [music] Let it go. Let
98:18 Free your mind. [music] Let it go. Let it break the chain. Heat. Heat.
98:33 Heat. [music] Heat.
98:44 Heat [music]
99:26 [music] they said the stars don't change
99:28 they said the stars don't change [singing] their course, but I've been
99:30 [singing] their course, but I've been running from their force. A mirror
99:34 running from their force. A mirror crack, but still it [music and singing]
99:36 crack, but still it [music and singing] shows. The fire is mine. It's mine to
99:40 shows. The fire is mine. It's mine to hold. I hear the echo and call [music]
99:43 hold. I hear the echo and call [music] my name,
99:50 but I'm not the shadow. [music] I'm not the same.
99:59 You are who you choose to be. The stars [music]
99:59 [music] are the history.
100:02 are the history. Every breath, every heartbeat.
100:05 Every breath, every heartbeat. [music]
100:38 [music] >> of thorns, a sky of glass. I've walked
100:42 >> of thorns, a sky of glass. I've walked through both. I've let them [singing]
100:44 through both. I've let them [singing] pass. The weight [music] is heavy, but
100:47 pass. The weight [music] is heavy, but I've grown. The voice I hear is now my
100:51 I've grown. The voice I hear is now my own. I see [music] the light
101:00 [music] change.
101:02 change. Heat. Heat. Heat.
101:57 >> I see the lines [music and singing] drawn in the sand. The map of chaos in
102:00 drawn in the sand. The map of chaos in mind.
102:09 Every step a [music] choice. Every beat of voice. The clock ticks
102:11 Every beat of voice. The clock ticks louder. But I stand. [music]
102:19 Close my eyes [singing] and feel it burn. Every [music] failure, every turn.
102:23 burn. Every [music] failure, every turn. It's fue for the fire inside. [music]
103:05 The air is heavy. It doesn't break. A thousand whispers in it wake.
103:16 Each breath [music] a climb. Each fall a sign, but I am more than I
103:20 Each fall a sign, but I am more than I can take. [music]
103:27 Close my eyes and feel it burn. Every failure, every turn is fueled for the
103:31 failure, every turn is fueled for the fire inside.
103:34 fire inside. [music]
104:03 >> No, no
104:07 [music] heat.
104:41 [music] >> The clock keeps ticking loud and clear.
104:57 I've been waiting for the light. [music]
104:58 [music] Holding breath through endless night.
105:06 [music] The air is shifting. Feel it break.
105:08 Feel it break. A single spark is all it [singing]
105:12 A single spark is all it [singing] takes.
105:14 takes. It starts today.
105:17 It starts today. It starts today.
105:20 It starts today. No more [music]
105:21 No more [music] running. No delay.
105:25 running. No delay. The world is spinning in my hands. It
105:30 The world is spinning in my hands. It [music] starts to
105:33 [music] starts to start today.
105:58 Every choice I made my own. [music] I see the dawn breaking through.
106:12 The air is shifting. [music] Feel it rain.
106:15 Feel it rain. A single spark is all in.
106:21 A single spark is all in. It starts to take it start.
106:52 [music] >> Heat up here.
107:29 Oh. [music]
107:39 >> [music] >> Fire in my chest is burning loud.
107:44 >> Fire in my chest is burning loud. Ashes fall, but I won't bow. [music]
107:49 Ashes fall, but I won't bow. [music] I've walked [singing] through the smoke.
107:52 I've walked [singing] through the smoke. I've tasted the scars. Each step I've
107:56 I've tasted the scars. Each step I've taken, lit up the stars. [music]
108:00 taken, lit up the stars. [music] Let it blaze, let it break. Feel the
108:04 Let it blaze, let it break. Feel the crack the ground sh
108:12 I'm forced in [singing] flame [music] I'm falling
108:15 I'm falling the pain they call me [music]
108:19 the pain they call me [music] deep
108:26 again from Heat. Heat. Heat.
108:28 from Heat. Heat. Heat. [music]
108:54 [music] >> The winds they how but I stand still.
108:59 >> The winds they how but I stand still. [music]
109:00 [music] The mountains crumble up my will.
109:05 The mountains crumble up my will. I'm not the same
109:08 I'm not the same I was before. A shadow of fear. I keep
109:23 let it blaze. [music and singing] Let it break. Feel the cracks. The ground will
109:26 break. Feel the cracks. The ground will shake.
109:33 I'm forged in flame. [music] Heat. Heat. Heat.
109:55 Heat. [music]
110:10 [music] Heat.
110:44 Shadows melt in the growing light. Time bends and twists. We feel it star.
110:53 A pulse [music] a spark [singing] an open heart.
110:57 open heart. Do you feel it? Feel it rise.
111:06 The weightless fire in the sky.
111:12 >> [music] >> has come.
111:15 >> has come. We're running to the sun.
111:38 electric in the trees that
111:42 in the trees that [music]
111:43 [music] stars collide, but we stay warm.
111:48 stars collide, but we stay warm. The past dissolves [music]
111:51 The past dissolves [music] like waves on storm.
111:55 like waves on storm. We stand together
112:40 The rush, [music] the fun, the everything.
112:45 the fun, the everything. A new age
112:47 A new age has come.
112:49 has come. We're running to the [music] sun. No
112:52 We're running to the [music] sun. No chains, no walls, just
112:56 chains, no walls, just with me. [music]
113:14 >> [music] >> Heat up
114:12 up [music] here.
114:37 >> up [music]
114:55 Heat [music]
115:59 [music] here.
116:54 [music] Heat.
117:22 Heat. >> [music]
117:57 Heat. [music] Heat.
118:06 [music] Heat. Heat.
118:47 >> Heat. Heat. [music]
120:02 [music] >> Heat. Heat.
120:35 >> Ladies and gentlemen, please welcome back to the stage Alex Lieberman.
120:43 Let's uh keep it going for the morning speakers. [music] Amazing job from
120:45 speakers. [music] Amazing job from everyone who spoke earlier. I asked
120:47 everyone who spoke earlier. I asked before who thought they came from the
120:49 before who thought they came from the furthest place on Earth to to watch this
120:52 furthest place on Earth to to watch this in person. And where's New Zealand
120:54 in person. And where's New Zealand again? I don't know. New Zealand. There
120:55 again? I don't know. New Zealand. There we go.
120:56 we go. >> From Bulgaria.
120:57 >> From Bulgaria. >> Bulgaria. Still, I think closer than New
121:00 >> Bulgaria. Still, I think closer than New Zealand, but still very far.
121:02 Zealand, but still very far. >> Australia via New Zealand.
121:03 >> Australia via New Zealand. >> Australia via New Zealand. We just got
121:05 >> Australia via New Zealand. We just got someone to one up New Zealand. I have
121:07 someone to one up New Zealand. I have another quick question since we just
121:09 another quick question since we just came back from a coffee break. Also, if
121:10 came back from a coffee break. Also, if you're watching live on YouTube, you can
121:12 you're watching live on YouTube, you can comment. Who thinks they're the most
121:13 comment. Who thinks they're the most caffeinated right now? Who thinks
121:16 caffeinated right now? Who thinks they're the most caffeinated in the
121:17 they're the most caffeinated in the room? How many cups of coffee? I'm four
121:19 room? How many cups of coffee? I'm four right now. Anyone beat four? Oh, we got
121:22 right now. Anyone beat four? Oh, we got four. We got a five, maybe. Wow,
121:24 four. We got a five, maybe. Wow, impressive. Well, we are back for an
121:26 impressive. Well, we are back for an incredible next block of sessions. We're
121:28 incredible next block of sessions. We're going to be covering everything from
121:29 going to be covering everything from future proofing uh coding agents to
121:32 future proofing uh coding agents to moving away from agile, how to quantify
121:35 moving away from agile, how to quantify AI ROI in software engineering, the
121:38 AI ROI in software engineering, the state of AI code quality, hype versus
121:41 state of AI code quality, hype versus reality, and Miniax M2. But I am so
121:44 reality, and Miniax M2. But I am so excited to kick off this next block of
121:47 excited to kick off this next block of talks with OpenAI. Please welcome to the
121:50 talks with OpenAI. Please welcome to the stage Bill Chen and Brian Fioa from the
121:53 stage Bill Chen and Brian Fioa from the Applied AI team at OpenAI. Let's hear it
121:55 Applied AI team at OpenAI. Let's hear it for them. [applause]
122:15 >> Hello everyone. Um, today we'll be talking about how to build coding
122:16 talking about how to build coding agents.
122:18 agents. And uh, I'm Bill. I work on the applied
122:21 And uh, I'm Bill. I work on the applied AI startups team at OpenAI.
122:23 AI startups team at OpenAI. >> And I'm Brian. I work with Bill on the
122:26 >> And I'm Brian. I work with Bill on the OpenAI startups team
122:27 OpenAI startups team >> and we specifically uh focus on uh
122:30 >> and we specifically uh focus on uh building coding agents here at OpenAI.
122:33 building coding agents here at OpenAI. Um yeah so why are we talk giving this
122:36 Um yeah so why are we talk giving this talk? Why why are we you know u talking
122:39 talk? Why why are we you know u talking about coding agents? Well it's really
122:41 about coding agents? Well it's really quite interesting because it's been
122:43 quite interesting because it's been booming for the the the past year
122:45 booming for the the the past year actually. It's just if you think about
122:47 actually. It's just if you think about it it's not that much time ago like only
122:49 it it's not that much time ago like only been a year or so. the ground keeps
122:52 been a year or so. the ground keeps shifting really under the uh harness on
122:54 shifting really under the uh harness on on the coding agents. But if you think
122:56 on the coding agents. But if you think about it, it's really like why it's
122:58 about it, it's really like why it's interesting is because it's really a
123:00 interesting is because it's really a signal on how close we are to AGI.
123:02 signal on how close we are to AGI. Software engineering can be set as a
123:04 Software engineering can be set as a universal medium for problem solving.
123:07 universal medium for problem solving. But because the ground is shifting so
123:08 But because the ground is shifting so fast, uh we h kept having to rebuild the
123:11 fast, uh we h kept having to rebuild the agent on top of the model whenever a
123:13 agent on top of the model whenever a model is released. And today we're going
123:15 model is released. And today we're going to talk a little bit about how we might
123:17 to talk a little bit about how we might be able to get around that.
123:21 be able to get around that. So, here's what we're going to go over
123:22 So, here's what we're going to go over today. We'll start with the anatomy of a
123:25 today. We'll start with the anatomy of a coding agent, especially going into the
123:27 coding agent, especially going into the details of models and harnesses and how
123:29 details of models and harnesses and how they work together. We'll share some
123:31 they work together. We'll share some lessons that we learned from putting
123:33 lessons that we learned from putting them together ourselves. And we're
123:36 them together ourselves. And we're specifically going to talk about codeex
123:37 specifically going to talk about codeex here, which is our own coding agent.
123:40 here, which is our own coding agent. We'll talk a little bit about emerging
123:41 We'll talk a little bit about emerging patterns that we're seeing from all of
123:44 patterns that we're seeing from all of you for using agents like Codeex in your
123:46 you for using agents like Codeex in your own products. And lastly, we'll talk a
123:48 own products. And lastly, we'll talk a little bit about what to expect from
123:51 little bit about what to expect from Codeex in the future so that you can
123:53 Codeex in the future so that you can build along with us if you want to.
124:01 To start, let's talk a little bit about what makes a coding agent an agent as a
124:04 what makes a coding agent an agent as a whole. Um, it really is quite simple. I
124:07 whole. Um, it really is quite simple. I think, you know, people kind of over
124:09 think, you know, people kind of over complicate things a little bit these
124:10 complicate things a little bit these days. It's made out of three parts. It's
124:12 days. It's made out of three parts. It's a user interface. It has a model. It's a
124:15 a user interface. It has a model. It's a harness, right? Uh the interface quite
124:17 harness, right? Uh the interface quite self-explanatory could be a computer uh
124:21 self-explanatory could be a computer uh like a CLI tool or it could be a uh
124:24 like a CLI tool or it could be a uh integrated developer environment could
124:26 integrated developer environment could be also cloud or background agent. Um,
124:30 be also cloud or background agent. Um, models also very quite self-explanatory
124:32 models also very quite self-explanatory are, you know, the things like the
124:35 are, you know, the things like the latest and greatest, the GPD 5.1 codeex
124:38 latest and greatest, the GPD 5.1 codeex uh max that we just released yesterday
124:41 uh max that we just released yesterday uh or the GPD 5.1 series of models or
124:45 uh or the GPD 5.1 series of models or other uh models from other providers as
124:47 other uh models from other providers as well. And the harness uh is a little bit
124:50 well. And the harness uh is a little bit more of an interesting part. This is the
124:52 more of an interesting part. This is the part that directly interacts with the
124:54 part that directly interacts with the model uh in the most reductive way. You
124:56 model uh in the most reductive way. You can sort of think of it as a collection
124:58 can sort of think of it as a collection of prompts and tools combined in a core
125:01 of prompts and tools combined in a core agent loop which provides input and
125:03 agent loop which provides input and outputs uh from a model. Uh the last
125:07 outputs uh from a model. Uh the last part will be our focus for today.
125:16 As touched on a bit earlier, coding is one of the most active frontiers in
125:17 one of the most active frontiers in applied AI and uh how models are
125:20 applied AI and uh how models are constantly getting released and we're
125:22 constantly getting released and we're not making the problem uh easier for
125:24 not making the problem uh easier for everybody
125:26 everybody is that people have to constantly adapt
125:30 is that people have to constantly adapt uh the agents to the new models.
125:39 So, um, Bill's done a great job of giving us an overview of coding agents,
125:41 giving us an overview of coding agents, what they're made up of. So, let's zoom
125:44 what they're made up of. So, let's zoom in a little bit on the harness. Um,
125:47 in a little bit on the harness. Um, turns out that's a little bit tricky.
125:49 turns out that's a little bit tricky. So, what is a harness? A harness is
125:52 So, what is a harness? A harness is really the interface layer to the model.
125:54 really the interface layer to the model. It's the surface area the model uses to
125:57 It's the surface area the model uses to talk to users and the code and perform
126:00 talk to users and the code and perform actions with tools. It's made up of all
126:03 actions with tools. It's made up of all of the pieces that the model needs to
126:06 of the pieces that the model needs to work over many turns, call tools, and
126:09 work over many turns, call tools, and and really write code for you and
126:11 and really write code for you and interpret what the user is actually
126:13 interpret what the user is actually asking. [snorts] Um, for some, the
126:16 asking. [snorts] Um, for some, the harness might actually be the special
126:18 harness might actually be the special sauce of the product. But as we're going
126:21 sauce of the product. But as we're going to go into a little bit more, it's
126:22 to go into a little bit more, it's really challenging work to build a good
126:25 really challenging work to build a good harness. And we'll talk about how we did
126:28 harness. And we'll talk about how we did that.
126:30 that. So let's see what are some of these
126:32 So let's see what are some of these challenges. Um just to name a few, AV is
126:37 challenges. Um just to name a few, AV is one. Um your [laughter]
126:39 one. Um your [laughter] um your brand new innovative custom tool
126:42 um your brand new innovative custom tool that you're giving to your agent might
126:44 that you're giving to your agent might not actually be something the model is
126:45 not actually be something the model is using is used to using. It may not have
126:48 using is used to using. It may not have ever seen that tool before in trading.
126:50 ever seen that tool before in trading. And even if it is, you need to spend
126:52 And even if it is, you need to spend time tuning your prompt to that
126:54 time tuning your prompt to that particular model and the habits that it
126:57 particular model and the habits that it comes with.
126:59 comes with. And new models are coming out all the
127:01 And new models are coming out all the time. What about latency? Like does the
127:04 time. What about latency? Like does the model take a while to think about
127:06 model take a while to think about certain things? Which things do you
127:08 certain things? Which things do you prompt it not to? How do you expose the
127:10 prompt it not to? How do you expose the UX of what a thinking model is doing
127:13 UX of what a thinking model is doing while it's thinking? Is it communicating
127:15 while it's thinking? Is it communicating with you while it's thinking or do you
127:17 with you while it's thinking or do you have to summarize it? Managing the
127:19 have to summarize it? Managing the context window and compaction can be
127:22 context window and compaction can be really challenging. We just launched
127:24 really challenging. We just launched Codeex Max that does that out of the box
127:26 Codeex Max that does that out of the box for you. you don't have to worry about
127:28 for you. you don't have to worry about compaction and context window
127:30 compaction and context window management. It's really hard to do. Um,
127:33 management. It's really hard to do. Um, and so if you were to do it yourself,
127:36 and so if you were to do it yourself, have fun. Um, and then also like the
127:38 have fun. Um, and then also like the APIs keep changing, right? So we have
127:40 APIs keep changing, right? So we have completions, we have responses, we have
127:41 completions, we have responses, we have whatever else is coming in the future.
127:44 whatever else is coming in the future. What does the model know how to use and
127:46 What does the model know how to use and get to get the most intelligence out of
127:48 get to get the most intelligence out of the box?
127:50 the box? And so
127:52 And so this is the interesting part. Fitting a
127:54 this is the interesting part. Fitting a model into a harness takes a lot of
127:57 model into a harness takes a lot of prompting.
127:59 prompting. It turns out that how the model is
128:00 It turns out that how the model is trained has side effects.
128:04 trained has side effects. I like to think about it this way.
128:07 I like to think about it this way. Intelligence plus habit. Intelligence.
128:10 Intelligence plus habit. Intelligence. What is the model good at? What
128:13 What is the model good at? What languages does it know really well? What
128:15 languages does it know really well? What is what is its capabilities in terms of
128:18 is what is its capabilities in terms of like how well it can write code in
128:20 like how well it can write code in certain frameworks? And then what habits
128:24 certain frameworks? And then what habits did it learn to to use to solve those
128:27 did it learn to to use to solve those problems? We've trained our models to
128:30 problems? We've trained our models to have habits of like planning a solution,
128:33 have habits of like planning a solution, looking around, gathering context, and
128:36 looking around, gathering context, and and thinking about a problem before
128:38 and thinking about a problem before diving in and writing code, and then
128:40 diving in and writing code, and then testing its work at the end.
128:43 testing its work at the end. Developing a feel for these habits is
128:45 Developing a feel for these habits is how you become a good prompt engineer.
128:48 how you become a good prompt engineer. If you don't instruct the model in ways
128:50 If you don't instruct the model in ways that it's familiar with, you can have
128:53 that it's familiar with, you can have problems. We saw this when we launched
128:56 problems. We saw this when we launched GPD5. A lot of people who weren't used
128:59 GPD5. A lot of people who weren't used to using our models encoding tried to
129:01 to using our models encoding tried to take prompts that existed for other
129:04 take prompts that existed for other models and put them into their harness
129:06 models and put them into their harness and have GPD5 follow those instructions.
129:09 and have GPD5 follow those instructions. And it turned out that we taught our
129:12 And it turned out that we taught our model to do some of the things that the
129:14 model to do some of the things that the other models didn't really do out of the
129:16 other models didn't really do out of the box. And so when they were prompting
129:18 box. And so when they were prompting them to look really hard at the context
129:21 them to look really hard at the context and like examine every single file
129:23 and like examine every single file before making a a code edit, our model
129:27 before making a a code edit, our model was being very kind of thorough about
129:29 was being very kind of thorough about that and it was taking a really long
129:31 that and it was taking a really long time and they weren't seeing the best
129:32 time and they weren't seeing the best performance. And so we figured out that
129:35 performance. And so we figured out that if you let the model just do the
129:37 if you let the model just do the behaviors that it's used to and don't
129:39 behaviors that it's used to and don't overprompt it, it'll actually perform
129:41 overprompt it, it'll actually perform really better. We found out by asking. I
129:44 really better. We found out by asking. I was literally like, "Hey, like I like
129:46 was literally like, "Hey, like I like the solution, but it took you a long
129:47 the solution, but it took you a long time to get there. What can I do
129:50 time to get there. What can I do differently in your instructions to help
129:52 differently in your instructions to help you get there faster next time?" And
129:53 you get there faster next time?" And literally it said, "Uh, you're telling
129:55 literally it said, "Uh, you're telling me to go look at everything and I don't
129:57 me to go look at everything and I don't really need to. So that's what's taking
129:59 really need to. So that's what's taking forever."
130:06 And so you can actually see the advantages of building both the model
130:07 advantages of building both the model and the harness together because you
130:09 and the harness together because you just like know all of that while you're
130:11 just like know all of that while you're building it. And that's why Codex is
130:14 building it. And that's why Codex is both a model and a harness combined.
130:17 both a model and a harness combined. So let's dig deeper into Codeex and what
130:20 So let's dig deeper into Codeex and what it can actually do.
130:23 it can actually do. So we built Codex to be an agent for
130:25 So we built Codex to be an agent for everywhere that you code. It's a VS Code
130:27 everywhere that you code. It's a VS Code plugin. It's a CLI. You can call it in
130:30 plugin. It's a CLI. You can call it in the cloud from the VS Code plugin or
130:32 the cloud from the VS Code plugin or from chatgbt from your phone. Um, and
130:36 from chatgbt from your phone. Um, and it's very basic. You can use it to turn
130:38 it's very basic. You can use it to turn your specs into runnable code starting
130:40 your specs into runnable code starting from a prompt. Um, having a plan. It
130:44 from a prompt. Um, having a plan. It navigates your repo to edit files. It
130:46 navigates your repo to edit files. It runs commands, executes tasks, and you
130:49 runs commands, executes tasks, and you can call it from Slack or you can have
130:52 can call it from Slack or you can have it review PRs and GitHub. So, all of the
130:54 it review PRs and GitHub. So, all of the things that you would expect.
130:58 things that you would expect. And that means that the that codec um
131:00 And that means that the that codec um the harness of codec needs to be able to
131:02 the harness of codec needs to be able to do a lot of really complex things. Uh
131:06 do a lot of really complex things. Uh when I talked to a member of the codeex
131:08 when I talked to a member of the codeex team about this slide and what should be
131:09 team about this slide and what should be on it, he was like it's way harder than
131:11 on it, he was like it's way harder than you think. [laughter]
131:13 you think. [laughter] You have to manage parallel tool calls
131:15 You have to manage parallel tool calls like thread merging and all of the
131:17 like thread merging and all of the things involved in that. Think about all
131:19 things involved in that. Think about all the security considerations you have
131:20 the security considerations you have with sandboxing, prompt forwarding,
131:23 with sandboxing, prompt forwarding, permissions, uh, port management. Um,
131:26 permissions, uh, port management. Um, compaction is a whole thing. Um, and
131:29 compaction is a whole thing. Um, and doing that well is really complex. When
131:31 doing that well is really complex. When do you trigger compaction? When do you
131:33 do you trigger compaction? When do you reingject? How do you worry about uh
131:35 reingject? How do you worry about uh cache optimization during that MCP,
131:38 cache optimization during that MCP, right? Like all of the uh plumbing you
131:41 right? Like all of the uh plumbing you have to build for MCP support into the
131:43 have to build for MCP support into the harness. Uh, and then not even
131:45 harness. Uh, and then not even mentioning images and what's the
131:47 mentioning images and what's the resolution that you need to compress
131:49 resolution that you need to compress them to to send them to the model. All
131:50 them to to send them to the model. All this all of this is like work that you
131:51 this all of this is like work that you have to do if you're going to build this
131:52 have to do if you're going to build this from scratch and keep it updated as new
131:55 from scratch and keep it updated as new features come online.
132:02 So since we've bundled all of these features together for you in an agent
132:05 features together for you in an agent that can safely write its own tools to
132:07 that can safely write its own tools to solve new problems that it encounters.
132:12 solve new problems that it encounters. Oops.
132:14 Oops. Uh we actually have here uh a computer
132:18 Uh we actually have here uh a computer use agent for the terminal.
132:28 Wow, that sounds quite a bit powerful than just plain old coding agent,
132:30 than just plain old coding agent, doesn't it? Um but just think about it
132:32 doesn't it? Um but just think about it again. Well, before browser and graphic
132:34 again. Well, before browser and graphic user interface was a thing, wasn't that
132:36 user interface was a thing, wasn't that how we always operate a computer?
132:39 how we always operate a computer? they're writing code and chaining them
132:40 they're writing code and chaining them together in a command line interface. So
132:43 together in a command line interface. So that means if you can express your tasks
132:45 that means if you can express your tasks in command line as well as files tasks
132:49 in command line as well as files tasks codeex will be able to know what to do.
132:52 codeex will be able to know what to do. Um the example is I like to use codeex
132:54 Um the example is I like to use codeex to organize a lot of the photos from my
132:57 to organize a lot of the photos from my desktop into a folder and that's a very
133:00 desktop into a folder and that's a very simple use case but what it can also do
133:02 simple use case but what it can also do is it can analyze huge amounts of CSV
133:05 is it can analyze huge amounts of CSV files inside of a folder uh doing data
133:08 files inside of a folder uh doing data analysis it does not have to be a coding
133:11 analysis it does not have to be a coding task and if it can be accomplished by
133:13 task and if it can be accomplished by running tools from command line you can
133:14 running tools from command line you can use codeex
133:16 use codeex so now that we see codeex is such a cool
133:18 so now that we see codeex is such a cool harness um I want to also share a little
133:21 harness um I want to also share a little a bit about how you can use it to build
133:23 a bit about how you can use it to build your own agents. And what you can do is
133:26 your own agents. And what you can do is you can use codeex the agent inside of
133:29 you can use codeex the agent inside of your own agent.
133:32 your own agent. Um, how does that work? Well, if you
133:36 Um, how does that work? Well, if you want to build uh a coding uh the next
133:40 want to build uh a coding uh the next coding startup, we don't really have all
133:42 coding startup, we don't really have all the answers, but we do have a few
133:43 the answers, but we do have a few patterns uh that we thought uh might
133:46 patterns uh that we thought uh might help you having worked with some of the
133:48 help you having worked with some of the top coding customers uh like cursor and
133:51 top coding customers uh like cursor and VS code. Uh one of those patterns is uh
133:55 VS code. Uh one of those patterns is uh harness becoming the new abstraction
133:57 harness becoming the new abstraction layer. The benefits of this is quite
134:00 layer. The benefits of this is quite obvious. Um, you no longer have to care
134:03 obvious. Um, you no longer have to care about prioritize uh optimizing the
134:05 about prioritize uh optimizing the prompt and tools with every model
134:07 prompt and tools with every model upgrade.
134:10 upgrade. >> But, um, does that mean you're just
134:11 >> But, um, does that mean you're just building a wrapper?
134:13 building a wrapper? >> Well, I disagree with that take.
134:15 >> Well, I disagree with that take. [snorts] I disagree. I was disagreeing
134:18 [snorts] I disagree. I was disagreeing with my colleague here. Um, just like
134:21 with my colleague here. Um, just like how building rappers on top of models I
134:23 how building rappers on top of models I think is really reductive on uh on the
134:27 think is really reductive on uh on the whole value prop of the infrastructure
134:29 whole value prop of the infrastructure layer. Sorry, I used to be a VC.
134:31 layer. Sorry, I used to be a VC. [laughter]
134:32 [laughter] >> Focusing most of your efforts on
134:34 >> Focusing most of your efforts on differentiating your product is what
134:36 differentiating your product is what this pattern allows you to do. And
134:38 this pattern allows you to do. And that's where most of the value lies.
134:46 Exactly. Okay. So, let's look at some of these patterns that we've seen and
134:48 these patterns that we've seen and actually have helped our customers build
134:51 actually have helped our customers build um along with them. Codeex is an SDK. It
134:55 um along with them. Codeex is an SDK. It can be called through a TypeScript
134:56 can be called through a TypeScript library. You can call it
134:58 library. You can call it programmatically in a Python exec.
135:00 programmatically in a Python exec. There's a GitHub action that you can
135:02 There's a GitHub action that you can plug into to have it merge merge
135:04 plug into to have it merge merge conflicts on PRs that everybody hates
135:07 conflicts on PRs that everybody hates doing. Then uh you can also add it to
135:10 doing. Then uh you can also add it to the agents SDK and give it MCP
135:14 the agents SDK and give it MCP connectors back to your product. So now
135:16 connectors back to your product. So now you have an agent. I like to say we
135:18 you have an agent. I like to say we started with chat bots that you can talk
135:20 started with chat bots that you can talk to. Then we gave the chatbots tools to
135:22 to. Then we gave the chatbots tools to use and then now you can give uh a tool
135:27 use and then now you can give uh a tool to your chatbot that can make other
135:29 to your chatbot that can make other tools that it doesn't have. And so now
135:32 tools that it doesn't have. And so now you can actually build out enterprise
135:35 you can actually build out enterprise software that does it that writes its
135:36 software that does it that writes its own plug-in connectors to the API level
135:39 own plug-in connectors to the API level for each customer on the spot. That's
135:43 for each customer on the spot. That's something that a professional services
135:44 something that a professional services team used to have to do. Um, so you have
135:46 team used to have to do. Um, so you have fully customizable software that can now
135:49 fully customizable software that can now talk back to itself. Um, I made a conbon
135:52 talk back to itself. Um, I made a conbon board for Devday that can actually fix
135:53 board for Devday that can actually fix its own bugs. Um, it's pretty fun. And
135:57 its own bugs. Um, it's pretty fun. And then lastly, um, you can actually do
135:59 then lastly, um, you can actually do something like what Zed has done. They
136:01 something like what Zed has done. They have just decided to wrap codeex inside
136:05 have just decided to wrap codeex inside of a layer and give it an interface to
136:07 of a layer and give it an interface to the IDE for talking back and forth for
136:10 the IDE for talking back and forth for the user and making code edits. And now
136:12 the user and making code edits. And now they don't actually have to do all the
136:14 they don't actually have to do all the work of staying on top of all of the
136:16 work of staying on top of all of the things that we're good at doing and they
136:17 things that we're good at doing and they can focus on building like the best code
136:20 can focus on building like the best code editor.
136:26 Uh so our top coding partners like GitHub has used this uh to great effect
136:29 GitHub has used this uh to great effect and well uh we've created an SDK for it
136:33 and well uh we've created an SDK for it that they used to directly integrate uh
136:35 that they used to directly integrate uh with codeex. You can also use the SDK to
136:38 with codeex. You can also use the SDK to uh control codecs as part of your CI/CD
136:41 uh control codecs as part of your CI/CD pipeline as well as use it as an agent
136:43 pipeline as well as use it as an agent that directly interacts with your own
136:45 that directly interacts with your own agent as well. Uh if you really want to
136:49 agent as well. Uh if you really want to customize the agent layer, you can do it
136:51 customize the agent layer, you can do it too. As an example of this, we worked
136:53 too. As an example of this, we worked with closely with the cursor team to get
136:56 with closely with the cursor team to get the best performance out of the codecs.
136:57 the best performance out of the codecs. The model, not the agent, we're bad at
136:59 The model, not the agent, we're bad at naming things. The model is different
137:01 naming things. The model is different from the agent. They did so by aligning
137:04 from the agent. They did so by aligning their tools to be in distribution with
137:06 their tools to be in distribution with how the model is trained and they did so
137:08 how the model is trained and they did so by aligning uh their harness with our
137:10 by aligning uh their harness with our open- source uh implementation of codeex
137:13 open- source uh implementation of codeex CLI. All of this is publicly available.
137:16 CLI. All of this is publicly available. Uh you can fork the repo, you can use
137:19 Uh you can fork the repo, you can use our source code, you can use it. Uh go
137:22 our source code, you can use it. Uh go nuts.
137:29 So what does the future hold for Codeex? It hasn't even been out for a year. Um
137:32 It hasn't even been out for a year. Um and especially with the lo la la la la
137:33 la la la la la la la la la la la la la la la la la la la la la la la la la la
137:33 la la la la la la la la la la la la la la la la la la la la la la launch of
137:33 la la la la la la la la la launch of codeex match yesterday like things are
137:35 codeex match yesterday like things are really changing fast. Uh it's the
137:38 really changing fast. Uh it's the fastest growing model in usage now
137:40 fastest growing model in usage now serving dozens of trillions of tokens
137:42 serving dozens of trillions of tokens per week which has actually doubled
137:45 per week which has actually doubled since dev day.
137:48 since dev day. It's always good to build where the
137:51 It's always good to build where the models are going. It's safe to assume
137:53 models are going. It's safe to assume that the models will get better. They'll
137:56 that the models will get better. They'll be able to get to work on much longer
137:58 be able to get to work on much longer horizon tasks unsupervised.
138:00 horizon tasks unsupervised. New models will raise the trust ceiling.
138:03 New models will raise the trust ceiling. I trust these models now to do some way
138:06 I trust these models now to do some way harder work than I would have six months
138:08 harder work than I would have six months ago. And that's going to keep
138:10 ago. And that's going to keep increasing. The future is about
138:12 increasing. The future is about sprawling code bases and non-standard
138:15 sprawling code bases and non-standard libraries and knowing how to work in
138:16 libraries and knowing how to work in closed source environments, matching
138:18 closed source environments, matching existing templates and practices
138:21 existing templates and practices and the models uh and and and so you can
138:24 and the models uh and and and so you can imagine that the SDK will evolve to
138:26 imagine that the SDK will evolve to better support these model capabilities,
138:29 better support these model capabilities, letting the model learn as it goes and
138:31 letting the model learn as it goes and not repeat mistakes and generally
138:33 not repeat mistakes and generally provide more surface area for an agent
138:36 provide more surface area for an agent that writes code and uses a terminal to
138:39 that writes code and uses a terminal to solve whatever problems it encounters.
138:41 solve whatever problems it encounters. counters and you can use that in your
138:44 counters and you can use that in your products via the SDK.
138:48 products via the SDK. So, what have we learned? Harnesses are
138:50 So, what have we learned? Harnesses are really complicated and take a lot of
138:52 really complicated and take a lot of work to maintain, especially with all
138:54 work to maintain, especially with all the new models coming out. So, we've
138:57 the new models coming out. So, we've built one for you inside of Codeex that
138:59 built one for you inside of Codeex that you can use off the shelf or look at the
139:02 you can use off the shelf or look at the source if you want to and you can use it
139:04 source if you want to and you can use it to build new things outside of coding
139:07 to build new things outside of coding and let us do all of the work making
139:09 and let us do all of the work making sure that you have the most capable
139:11 sure that you have the most capable computer agent.
139:13 computer agent. And we're really excited to see what you
139:15 And we're really excited to see what you craft.
139:23 Our [applause] [music]
139:32 next presenters believe that most enterprises are failing to unlock real
139:34 enterprises are failing to unlock real value from AI because the systems in
139:37 value from AI because the systems in which they operate are [music] stuck in
139:39 which they operate are [music] stuck in the past. Here to share how agents are
139:42 the past. Here to share how agents are reshaping software delivery are McKenzie
139:44 reshaping software delivery are McKenzie partners Martin Harrison and Natasha
139:47 partners Martin Harrison and Natasha Mania.
140:00 >> All right, good morning. Hello everyone. It's really great to be here. Uh so I'm
140:03 It's really great to be here. Uh so I'm Martin and I'm here with my colleague
140:04 Martin and I'm here with my colleague Natasha. Uh we're from a part of
140:07 Natasha. Uh we're from a part of Mckenzie you may may not be as familiar
140:09 Mckenzie you may may not be as familiar with. We have a practice called software
140:11 with. We have a practice called software X and we work with uh mostly enterprise
140:14 X and we work with uh mostly enterprise clients on how to build better software
140:17 clients on how to build better software products which has messed mostly using
140:19 products which has messed mostly using AI uh in the in the past couple of
140:21 AI uh in the in the past couple of years.
140:23 years. Uh and so what our talk is about today
140:26 Uh and so what our talk is about today is really more focused on the people and
140:29 is really more focused on the people and the operating model aspects of
140:31 the operating model aspects of leveraging AI for software development
140:33 leveraging AI for software development and and that we believe that that has
140:35 and and that we believe that that has changed quite significantly and and
140:37 changed quite significantly and and that's what we're excited to talk to you
140:39 that's what we're excited to talk to you about.
140:41 about. If I take a quick step back uh in in
140:43 If I take a quick step back uh in in time and we just uh you know think
140:46 time and we just uh you know think through some of these the major
140:47 through some of these the major technology breakthroughs that we've seen
140:49 technology breakthroughs that we've seen in the last few decades uh they tend to
140:52 in the last few decades uh they tend to always come with a paradigm shift in
140:54 always come with a paradigm shift in also how we develop software and so I
140:57 also how we develop software and so I still recall uh almost 20 years ago now
140:59 still recall uh almost 20 years ago now I started working as a software engineer
141:01 I started working as a software engineer an entry- level developer um in a tech
141:04 an entry- level developer um in a tech company and the company I was working
141:06 company and the company I was working for was just switching to to agile we
141:09 for was just switching to to agile we were using camb boards we were doing uh
141:11 were using camb boards we were doing uh standups and and other ceremonies. This
141:14 standups and and other ceremonies. This was a big change. It was a massive
141:16 was a big change. It was a massive change for the for the company. And now
141:20 change for the for the company. And now with everything that is happening
141:21 with everything that is happening happening in AI, we're at the precipice
141:24 happening in AI, we're at the precipice of another such paradigm shift.
141:27 of another such paradigm shift. And
141:29 And um
141:31 um if we think about some of the um some of
141:33 if we think about some of the um some of the things that are happening um with AI
141:36 the things that are happening um with AI and software development that we've seen
141:38 and software development that we've seen at this um at this conference, there's
141:41 at this um at this conference, there's no doubt that this is a new paradigm
141:43 no doubt that this is a new paradigm that is about us. And so we'll talk
141:44 that is about us. And so we'll talk about two things. Uh we'll first touch a
141:47 about two things. Uh we'll first touch a little bit about how do you go from
141:49 little bit about how do you go from these things that we're seeing at
141:51 these things that we're seeing at individual productivity to scaling that
141:53 individual productivity to scaling that to the whole team and what that what
141:55 to the whole team and what that what type of changes we think that implies
141:57 type of changes we think that implies and then we'll talk a little bit uh
141:59 and then we'll talk a little bit uh about how do you scale that across uh a
142:01 about how do you scale that across uh a whole organization and to really get get
142:03 whole organization and to really get get value
142:08 um if if you sort of I I'm talking to an
142:10 if if you sort of I I'm talking to an audience here which is using AI agents
142:13 audience here which is using AI agents all the time and I if I if I asked you
142:16 all the time and I if I if I asked you about some examples. I'm sure you could
142:18 about some examples. I'm sure you could rattle off, you know, 10 different ones
142:20 rattle off, you know, 10 different ones where you would say, "Look, there was
142:22 where you would say, "Look, there was this thing that I used to do. It it used
142:24 this thing that I used to do. It it used to take uh maybe even days and and and
142:28 to take uh maybe even days and and and hours that are now taking only minutes,
142:31 hours that are now taking only minutes, right? There's no shortage of those
142:33 right? There's no shortage of those those stories. And you can go over to
142:35 those stories. And you can go over to the expo and and talk to any of the
142:37 the expo and and talk to any of the companies there about all these all
142:39 companies there about all these all these great use cases. It really shows
142:41 these great use cases. It really shows that these tools work and they can be
142:42 that these tools work and they can be really impactful." And so yet despite
142:46 really impactful." And so yet despite seeing you know some of these uh
142:47 seeing you know some of these uh improvement uh improvements
142:51 improvement uh improvements uh we've done some research to gauge you
142:53 uh we've done some research to gauge you know where are our clients at the
142:55 know where are our clients at the moment. We we recently surveyed about
142:57 moment. We we recently surveyed about 300 uh companies uh mostly enterprises
143:01 300 uh companies uh mostly enterprises around what are they seeing in terms of
143:03 around what are they seeing in terms of productivity improvements. So you have
143:05 productivity improvements. So you have this and then they would say uh on
143:08 this and then they would say uh on average we're often seeing only 5 10 15%
143:11 average we're often seeing only 5 10 15% improvements overall as as a company. So
143:14 improvements overall as as a company. So we're in a place where there's a bit of
143:16 we're in a place where there's a bit of a disconnect between this this big
143:18 a disconnect between this this big potential uh around AI as uh from the
143:22 potential uh around AI as uh from the reality.
143:23 reality. And so we we think that um there is this
143:28 And so we we think that um there is this gap because as we've started
143:30 gap because as we've started implementing AI whether it's um you know
143:32 implementing AI whether it's um you know coding assistance or whether it's now
143:34 coding assistance or whether it's now using you know you just heard about uh
143:37 using you know you just heard about uh you know how open AI is using agents and
143:40 you know how open AI is using agents and more complex uh workflows. What has
143:43 more complex uh workflows. What has started to emerge is a is a set of
143:45 started to emerge is a is a set of bottlenecks uh that that were not
143:48 bottlenecks uh that that were not necessarily there before. Like for for
143:51 necessarily there before. Like for for example, as we now start moving much
143:54 example, as we now start moving much faster in certain in certain aspects of
143:56 faster in certain in certain aspects of the work, uh we haven't really changed
143:58 the work, uh we haven't really changed how we collaborate among people and and
144:00 how we collaborate among people and and team members. That's not quite keeping
144:02 team members. That's not quite keeping up.
144:03 up. We started generating way more more
144:05 We started generating way more more code, but we're it's still being
144:07 code, but we're it's still being reviewed in a in a pretty manual way in
144:09 reviewed in a in a pretty manual way in in many companies. Then we also have
144:12 in many companies. Then we also have this this theme which was recently
144:14 this this theme which was recently highlighted in in even a research report
144:16 highlighted in in even a research report from from Carnegie Melon uh about how
144:19 from from Carnegie Melon uh about how all the new code that is being generated
144:21 all the new code that is being generated is also amplifying uh the generation of
144:23 is also amplifying uh the generation of tech debt in some in some cases and
144:25 tech debt in some in some cases and actually generating complexity. And so
144:29 actually generating complexity. And so there are these bottlenecks. They're not
144:30 there are these bottlenecks. They're not impossible to overcome but this is what
144:32 impossible to overcome but this is what we believe is limiting uh many companies
144:35 we believe is limiting uh many companies from seeing the the the real value that
144:37 from seeing the the the real value that that they should be seeing.
144:40 that they should be seeing. Let me talk about maybe just a couple of
144:43 Let me talk about maybe just a couple of examples to to make that uh come to life
144:46 examples to to make that uh come to life a little bit more. One of the things
144:48 a little bit more. One of the things that we see as a big rate limiter at the
144:50 that we see as a big rate limiter at the moment is around how work is allocated.
144:53 moment is around how work is allocated. And so what what we've learned over the
144:55 And so what what we've learned over the last couple of years is that the impact
144:57 last couple of years is that the impact from AI and agents is highly uneven.
144:59 from AI and agents is highly uneven. There are some tasks which where it
145:01 There are some tasks which where it works amazingly well today and you see
145:04 works amazingly well today and you see uh huge improvements and there are
145:06 uh huge improvements and there are others where it it's not as effective
145:08 others where it it's not as effective and so you have that variability. You
145:10 and so you have that variability. You also have variability among people. Some
145:12 also have variability among people. Some have have uh lots of experience now
145:14 have have uh lots of experience now using these tools and and know how to
145:17 using these tools and and know how to pick that up and others uh are less
145:19 pick that up and others uh are less experienced right now and so what that
145:21 experienced right now and so what that means for for team leaders for
145:23 means for for team leaders for engineering managers and so on is it's
145:25 engineering managers and so on is it's very highly non-trivial to know how to
145:28 very highly non-trivial to know how to allocate work and resources in in a good
145:30 allocate work and resources in in a good way and this is creating a lot of
145:32 way and this is creating a lot of inefficiencies.
145:34 inefficiencies. Another example uh is is around how work
145:38 Another example uh is is around how work is being reviewed. So agents are often
145:41 is being reviewed. So agents are often giving given pretty uh fuzzy uh you know
145:45 giving given pretty uh fuzzy uh you know stories that are written in pros with
145:47 stories that are written in pros with pretty fussy acceptance criteria. Uh
145:50 pretty fussy acceptance criteria. Uh which which means that the code that
145:51 which which means that the code that comes back is not always what it was
145:53 comes back is not always what it was intended to be and and for many
145:56 intended to be and and for many companies the only mechanism to control
145:58 companies the only mechanism to control that is is often manual review. So
146:00 that is is often manual review. So you've automated some things but we've
146:02 you've automated some things but we've generated more manual review. So these
146:04 generated more manual review. So these are some of the some of the examples of
146:06 are some of the some of the examples of uh these bottlenecks that we that we see
146:08 uh these bottlenecks that we that we see coming up.
146:13 And as mentioned what what has that has resulted in so far is that most most
146:17 resulted in so far is that most most large companies today uh are are stuck a
146:20 large companies today uh are are stuck a little bit in in a world of relatively
146:23 little bit in in a world of relatively marginal gains. Uh they're working in
146:26 marginal gains. Uh they're working in ways that was developed with constraints
146:29 ways that was developed with constraints that we had in the past paradigm of
146:30 that we had in the past paradigm of human development. So you have you you
146:33 human development. So you have you you know if you go out to most companies you
146:34 know if you go out to most companies you see 8 to 10 person teams you see working
146:38 see 8 to 10 person teams you see working in two week sprints you have all these
146:40 in two week sprints you have all these these elements that were largely parts
146:42 these elements that were largely parts of like an of an agile operating uh
146:44 of like an of an agile operating uh model and that is and that is uh putting
146:47 model and that is and that is uh putting in some some limits to what they can
146:49 in some some limits to what they can see. Over the past year, we've been
146:52 see. Over the past year, we've been working with lots of clients to to sort
146:54 working with lots of clients to to sort of break that model a bit uh and develop
146:57 of break that model a bit uh and develop new ways of of working in smaller teams
147:00 new ways of of working in smaller teams in new roles uh in with shorter cycles.
147:03 in new roles uh in with shorter cycles. And when you do that, we see really
147:05 And when you do that, we see really great performance improvements and
147:07 great performance improvements and that's what gives you gives us this uh
147:09 that's what gives you gives us this uh path to where we see things are going to
147:12 path to where we see things are going to improve.
147:18 So we realized that rewiring the PDLC is not just a one-sizefits-all solution.
147:21 not just a one-sizefits-all solution. For example, different types of
147:22 For example, different types of engineering functions across the
147:24 engineering functions across the enterprise along the product life cycle
147:26 enterprise along the product life cycle may require different operating models
147:28 may require different operating models based on how humans and agents best
147:30 based on how humans and agents best collaborate. So if we take the example
147:33 collaborate. So if we take the example of modernizing legacy code bases, this
147:36 of modernizing legacy code bases, this task requires a high context of
147:39 task requires a high context of potentially the entire codebase but also
147:41 potentially the entire codebase but also has clearly well- definfined outputs. So
147:44 has clearly well- definfined outputs. So an example operating model could look
147:46 an example operating model could look like a factory of agents where humans
147:48 like a factory of agents where humans provide an initial spec and final review
147:51 provide an initial spec and final review with minimal intervention.
147:53 with minimal intervention. For new features for green field and
147:56 For new features for green field and brownfield projects, the operating model
147:58 brownfield projects, the operating model may look like an iterative loop because
148:00 may look like an iterative loop because they may benefit from the
148:02 they may benefit from the non-deterministic outputs and increased
148:05 non-deterministic outputs and increased variation where agents act as
148:07 variation where agents act as co-creators um providing more options to
148:10 co-creators um providing more options to facilitate faster feedback loops.
148:15 facilitate faster feedback loops. So, as we mentioned, we did a survey
148:17 So, as we mentioned, we did a survey among 300 enterprises globally to
148:19 among 300 enterprises globally to understand what sets these top
148:21 understand what sets these top performers apart. We found that they are
148:23 performers apart. We found that they are seven times more likely to have AI
148:26 seven times more likely to have AI native workflows which meant scaling
148:28 native workflows which meant scaling over four use cases across the software
148:30 over four use cases across the software development life cycle rather than just
148:32 development life cycle rather than just having point solutions for just code
148:34 having point solutions for just code review or for just code dov. They were
148:37 review or for just code dov. They were also six times more likely to have AI
148:39 also six times more likely to have AI native roles which meant having smaller
148:42 native roles which meant having smaller pods with different skill sets and new
148:44 pods with different skill sets and new roles.
148:46 roles. To enable these shifts, these
148:48 To enable these shifts, these organizations were investing in
148:50 organizations were investing in continuous and hands-on upskilling,
148:52 continuous and hands-on upskilling, impact measurement, and also incentive
148:55 impact measurement, and also incentive structures to incentivize developers and
148:58 structures to incentivize developers and PMs to adopt AI.
149:01 PMs to adopt AI. This led to five to six times increase
149:04 This led to five to six times increase in time to market and delivery speed as
149:07 in time to market and delivery speed as well as higher quality and more
149:09 well as higher quality and more consistent artifacts.
149:12 consistent artifacts. So when we talk about AI native
149:13 So when we talk about AI native workflows we mean that these enterprises
149:16 workflows we mean that these enterprises are moving away from quarterly planning
149:18 are moving away from quarterly planning to continuous planning and also um the
149:21 to continuous planning and also um the unit of work is moving from storydriven
149:24 unit of work is moving from storydriven to spec driven development so that these
149:27 to spec driven development so that these PMs are iterating on the specs with
149:29 PMs are iterating on the specs with agents rather than iterating on long
149:31 agents rather than iterating on long PRDs.
149:33 PRDs. On the talent side, AI native roles
149:35 On the talent side, AI native roles essentially means that we're moving away
149:37 essentially means that we're moving away from the two pizza structure to one
149:40 from the two pizza structure to one pizza pods of three to five individuals.
149:42 pizza pods of three to five individuals. Instead of having separate QA front-end
149:45 Instead of having separate QA front-end and back-end engineers, there are more
149:48 and back-end engineers, there are more consolidated roles where product
149:49 consolidated roles where product builders are managing and orchestrating
149:51 builders are managing and orchestrating agents with full stack fluency and also
149:54 agents with full stack fluency and also a better understanding of the full
149:56 a better understanding of the full architecture of their codebase. PMS are
149:59 architecture of their codebase. PMS are starting to create direct um prototypes
150:01 starting to create direct um prototypes in code rather than iterating on these
150:04 in code rather than iterating on these long PRDs.
150:06 long PRDs. And one example um that we've described
150:08 And one example um that we've described in our article, we've studied some AI
150:10 in our article, we've studied some AI native startups and realized that
150:12 native startups and realized that they've actually implemented all of
150:14 they've actually implemented all of these shifts to accelerate their
150:16 these shifts to accelerate their outcomes. And in our article, we've
150:18 outcomes. And in our article, we've described how cursor actually operates
150:19 described how cursor actually operates internally.
150:21 internally. But if you're a large enterprise
150:23 But if you're a large enterprise predicated on the agile model, what are
150:25 predicated on the agile model, what are some steps you can take? So in in a
150:28 some steps you can take? So in in a recent client study with a leading
150:29 recent client study with a leading international bank, we tested some team
150:32 international bank, we tested some team level interventions to address the
150:33 level interventions to address the bottlenecks previously mentioned before
150:36 bottlenecks previously mentioned before mainly around the sequencing of steps
150:38 mainly around the sequencing of steps within the agile ceremony and how uh to
150:41 within the agile ceremony and how uh to define the roles of agents and humans
150:44 define the roles of agents and humans within the sprint cycle. So let's walk
150:46 within the sprint cycle. So let's walk through some examples.
150:48 through some examples. First, team leads would assign sprint
150:51 First, team leads would assign sprint stories using agents based on the data
150:53 stories using agents based on the data of the team velocity and delivery
150:55 of the team velocity and delivery history. And then they would create
150:58 history. And then they would create co-create multiple prototypes and
151:00 co-create multiple prototypes and iterate with agents on the acceptance
151:02 iterate with agents on the acceptance criteria around security and
151:04 criteria around security and observability needs to have more
151:05 observability needs to have more consistent artifacts across teams. This
151:08 consistent artifacts across teams. This prevents downstream rework that was
151:11 prevents downstream rework that was mentioned before so that developers
151:13 mentioned before so that developers don't have to constantly be iterating
151:15 don't have to constantly be iterating with the agents during during the code
151:17 with the agents during during the code process. The squads were also
151:19 process. The squads were also reorganized by workflow. So there would
151:21 reorganized by workflow. So there would be one which would be focused on um
151:24 be one which would be focused on um small bug fixes and another focused on
151:27 small bug fixes and another focused on green field development. In the
151:29 green field development. In the background agents would be used to look
151:32 background agents would be used to look and impact uh look at um the potential
151:36 and impact uh look at um the potential cross repository impacts um to prevent
151:39 cross repository impacts um to prevent debugging time for developers.
151:42 debugging time for developers. And another example is that instead of
151:44 And another example is that instead of re for reducing the collaboration
151:46 re for reducing the collaboration overhead and meetings that happen within
151:48 overhead and meetings that happen within this sprint cycle um instead of waiting
151:51 this sprint cycle um instead of waiting for data scientist input PMS would
151:54 for data scientist input PMS would directly be observing the real-time
151:56 directly be observing the real-time customer feedback to rep prioritize
151:58 customer feedback to rep prioritize these features
152:00 these features and this would lead to an acceleration
152:02 and this would lead to an acceleration in the backlog within the same amount of
152:05 in the backlog within the same amount of time.
152:07 time. So we studied the um impact of these
152:10 So we studied the um impact of these interventions and found high promising
152:12 interventions and found high promising results. For example, not just the
152:15 results. For example, not just the increase in agent consumption by over 60
152:18 increase in agent consumption by over 60 times, but there was also an increase in
152:20 times, but there was also an increase in the delivery speed that was tied
152:22 the delivery speed that was tied directly to the business priorities for
152:24 directly to the business priorities for this bank. There was a 51% increase in
152:27 this bank. There was a 51% increase in code mergers, but also a decrease in um
152:31 code mergers, but also a decrease in um an increase in efficiency.
152:34 an increase in efficiency. The other aspect of this is is uh around
152:37 The other aspect of this is is uh around the different roles and and and the
152:39 the different roles and and and the talent model. And so one of the biggest
152:42 talent model. And so one of the biggest differentiators that we saw as mentioned
152:44 differentiators that we saw as mentioned was around what you have actually
152:46 was around what you have actually changed the roles that uh that are
152:48 changed the roles that uh that are involved in software development. And so
152:50 involved in software development. And so you know what what you all are seeing is
152:53 you know what what you all are seeing is that engineers are moving away from
152:55 that engineers are moving away from execution and and just simply writing
152:57 execution and and just simply writing code to being more of orchestrators and
153:00 code to being more of orchestrators and and thinking through more how to divide
153:02 and thinking through more how to divide up work to agents. for example. And we
153:04 up work to agents. for example. And we also heard some examples of how the role
153:06 also heard some examples of how the role of the product manager is changing. And
153:09 of the product manager is changing. And so while this this may sound, you know,
153:11 so while this this may sound, you know, pretty straightforward to many of you
153:12 pretty straightforward to many of you here who are who are working with these
153:14 here who are who are working with these tools like day-to-day that you have to
153:16 tools like day-to-day that you have to change what you do, the reality is that
153:19 change what you do, the reality is that about 70% of the companies that we that
153:22 about 70% of the companies that we that we survey have have not changed the
153:23 we survey have have not changed the roles at all. Right? And so you have
153:26 roles at all. Right? And so you have this background expectation that people
153:28 this background expectation that people are going to do things differently but
153:29 are going to do things differently but the the role is still defined in the
153:31 the the role is still defined in the same way and it's the same understanding
153:33 same way and it's the same understanding uh as it was you know a couple of years
153:35 uh as it was you know a couple of years ago.
153:37 ago. Um but we are starting to see you know
153:40 Um but we are starting to see you know some companies changing this. So this is
153:42 some companies changing this. So this is another example from a from another
153:43 another example from a from another recent recent client. They were set up
153:46 recent recent client. They were set up in a in a way that is, you know, pretty
153:48 in a in a way that is, you know, pretty common for for u many companies and a
153:51 common for for u many companies and a kind of typical two pizza uh team model
153:55 kind of typical two pizza uh team model with with the types of roles that you
153:56 with with the types of roles that you would be familiar with. Um the we ran a
154:01 would be familiar with. Um the we ran a bunch of experiments and front runners
154:03 bunch of experiments and front runners and and tested new models that were had
154:05 and and tested new models that were had much smaller pods uh that had uh new
154:09 much smaller pods uh that had uh new roles which consolidated some of the
154:12 roles which consolidated some of the tasks that were previously done with
154:14 tasks that were previously done with different roles.
154:15 different roles. And and so by doing that we could we
154:18 And and so by doing that we could we could create basically more pods or more
154:20 could create basically more pods or more teams uh with with the same number of
154:23 teams uh with with the same number of people uh but retaining the expectation
154:25 people uh but retaining the expectation that each pod is is uh is um performing
154:29 that each pod is is uh is um performing at about the same level as as they were
154:31 at about the same level as as they were before.
154:33 before. And so so we also see really uh really
154:36 And so so we also see really uh really positive results from that uh with with
154:39 positive results from that uh with with uh maintaining and even improving in
154:41 uh maintaining and even improving in some is the quality of the code that was
154:43 some is the quality of the code that was generated. In particular there was a
154:45 generated. In particular there was a there was a high speed up in in terms of
154:49 there was a high speed up in in terms of uh the output from from the different
154:50 uh the output from from the different teams and you can see some of the
154:51 teams and you can see some of the metrics uh here.
154:57 Let's shift gears a little bit and and and go from talking about just the team
154:59 and go from talking about just the team level. So how does this now scale uh
155:02 level. So how does this now scale uh across a big organization?
155:05 across a big organization? The reality is that many many companies
155:08 The reality is that many many companies don't just have like one or two of these
155:10 don't just have like one or two of these these teams but often hundreds of teams
155:12 these teams but often hundreds of teams even and thousands or even tens of
155:14 even and thousands or even tens of thousands of people who are working in
155:16 thousands of people who are working in this way. And uh this is where one of
155:20 this way. And uh this is where one of the biggest differences that we that we
155:22 the biggest differences that we that we saw between those that are stuck a bit
155:24 saw between those that are stuck a bit in the um in in getting only 10% or so
155:28 in the um in in getting only 10% or so change improvements from those who are
155:30 change improvements from those who are seeing outsized improvements is around
155:33 seeing outsized improvements is around how you manage that how you manage that
155:35 how you manage that how you manage that change and change management I is like
155:38 change and change management I is like one of these is a little bit of an often
155:40 one of these is a little bit of an often catch or elusive term for uh for a lot
155:43 catch or elusive term for uh for a lot of different things but but I think in
155:46 of different things but but I think in some ways it's not a bad way to think
155:47 some ways it's not a bad way to think about Right. I I usually say that the
155:49 about Right. I I usually say that the change management is about getting a lot
155:51 change management is about getting a lot of like small things right. And so the
155:54 of like small things right. And so the crux to like actually scaling this is
155:56 crux to like actually scaling this is often about getting 20 30 or even more
155:59 often about getting 20 30 or even more things right at the same time that
156:01 things right at the same time that involve the way you communicate uh what
156:04 involve the way you communicate uh what this means, the way you incentivize
156:05 this means, the way you incentivize people, uh the way you upskill them, and
156:08 people, uh the way you upskill them, and it all has to come together.
156:11 it all has to come together. Um and when it when it's not, we we we
156:14 Um and when it when it's not, we we we see what happens. And so this is an
156:16 see what happens. And so this is an example from from another tech company
156:18 example from from another tech company that we worked with um where initially
156:21 that we worked with um where initially we're rolling out new AI tools for them
156:24 we're rolling out new AI tools for them that that hit different parts of the
156:26 that that hit different parts of the product development life cycle. Um we we
156:29 product development life cycle. Um we we rolled we rolled out the tools there was
156:31 rolled we rolled out the tools there was some usage but often it dropped off. It
156:33 some usage but often it dropped off. It was either not used or it was um it was
156:36 was either not used or it was um it was sort of um used in very suboptimal ways.
156:39 sort of um used in very suboptimal ways. So that's the sort of jagged part that
156:40 So that's the sort of jagged part that you're seeing on the on the left hand
156:42 you're seeing on the on the left hand side here. despite kind of adding more
156:44 side here. despite kind of adding more users uh the overall impact did not
156:47 users uh the overall impact did not change at all. So we had to do a quite a
156:50 change at all. So we had to do a quite a reset and and um start over effectively
156:54 reset and and um start over effectively reset the expectations. What should what
156:56 reset the expectations. What should what what does this mean if you're a
156:57 what does this mean if you're a developer dayto-day? What does it mean
156:59 developer dayto-day? What does it mean for a PM? Uh we had much more hands-on
157:02 for a PM? Uh we had much more hands-on upskilling. There was could bring your
157:04 upskilling. There was could bring your own code. there were, you know, coaches
157:06 own code. there were, you know, coaches available, especially those first like
157:08 available, especially those first like few sprints before you get make this a
157:10 few sprints before you get make this a habit and work it into the way that you
157:13 habit and work it into the way that you develop software dayto-day. It's a very
157:15 develop software dayto-day. It's a very critical time and that's when when this
157:17 critical time and that's when when this matters a lot. Um, and having a bit of a
157:21 matters a lot. Um, and having a bit of a a measurement system as well, so you
157:23 a measurement system as well, so you know what's changing and and you're able
157:24 know what's changing and and you're able to to see what's uh what's what what's
157:28 to to see what's uh what's what what's improving.
157:30 improving. Another example just to put this alive a
157:32 Another example just to put this alive a little bit as mentioned like this is
157:34 little bit as mentioned like this is about getting a lot of things um right
157:38 about getting a lot of things um right and it's each one of these individually
157:40 and it's each one of these individually may not seem like it's the biggest deal
157:42 may not seem like it's the biggest deal uh but put together they really make a
157:44 uh but put together they really make a make a huge difference like this is for
157:47 make a huge difference like this is for this is some of the top uh interventions
157:49 this is some of the top uh interventions that another client had to go through
157:52 that another client had to go through for them it really helped having you
157:54 for them it really helped having you know setting up code labs for example
157:56 know setting up code labs for example really you know instituting a new set of
157:58 really you know instituting a new set of certifications that help motivate and
158:00 certifications that help motivate and and drive people to to change what they
158:02 and drive people to to change what they do day dayto-day. And these these things
158:05 do day dayto-day. And these these things really added up to uh the change they
158:08 really added up to uh the change they needed.
158:10 needed. >> But building a robust measurement system
158:12 >> But building a robust measurement system that prioritizes outcomes and not just
158:14 that prioritizes outcomes and not just adoption is important not just to
158:16 adoption is important not just to monitor progress but also pinpoint
158:19 monitor progress but also pinpoint issues and course correct quickly. So,
158:22 issues and course correct quickly. So, one surprising result from the survey
158:24 one surprising result from the survey was that these enterprises that were
158:26 was that these enterprises that were bottom performers were not even
158:27 bottom performers were not even measuring speed and only 10% were
158:29 measuring speed and only 10% were measuring productivity.
158:31 measuring productivity. But our goal is to make our clients top
158:33 But our goal is to make our clients top performing organizations. So, we've
158:35 performing organizations. So, we've worked with them to create a holistic
158:36 worked with them to create a holistic measurement system that captures impact
158:40 measurement system that captures impact all the way down to inputs. So for
158:42 all the way down to inputs. So for inputs this would include the investment
158:45 inputs this would include the investment into coding tools and other AI tools but
158:47 into coding tools and other AI tools but also the time and resources in
158:49 also the time and resources in upskilling and change management. These
158:52 upskilling and change management. These inputs would lead to direct outputs but
158:54 inputs would lead to direct outputs but a lot of organizations are just focusing
158:56 a lot of organizations are just focusing on how the increased breath and depth of
158:59 on how the increased breath and depth of adoption with of AI tools is leading to
159:01 adoption with of AI tools is leading to increased velocity and capac capacity
159:04 increased velocity and capac capacity increase. However, it's also important
159:06 increase. However, it's also important to understand how developers have uh
159:09 to understand how developers have uh different uh NPS scores and if they're
159:11 different uh NPS scores and if they're enjoying their craft more um rather than
159:13 enjoying their craft more um rather than feeling more frustrated. And it's also
159:16 feeling more frustrated. And it's also important to understand whether the code
159:18 important to understand whether the code is becoming more secure and have has
159:20 is becoming more secure and have has better quality but also more resilient.
159:22 better quality but also more resilient. And one proxy for resiliency that we
159:24 And one proxy for resiliency that we used for our client was the meantime to
159:27 used for our client was the meantime to resolve priority bugs.
159:29 resolve priority bugs. Now if we look at economic outcomes
159:31 Now if we look at economic outcomes which is priority for um the seauite
159:34 which is priority for um the seauite executives they look into what is the
159:36 executives they look into what is the time to revenue target. What is the
159:38 time to revenue target. What is the increased price differential for higher
159:40 increased price differential for higher quality features or expanding the number
159:42 quality features or expanding the number of customers to meet the feature demand
159:44 of customers to meet the feature demand and also what is the cost reduction per
159:47 and also what is the cost reduction per pod for reduced human labor.
159:50 pod for reduced human labor. In aggregate, having these larger
159:53 In aggregate, having these larger economic outcomes can also lead um to
159:56 economic outcomes can also lead um to for organizations to understand how
159:58 for organizations to understand how there is an increased reinvestment in
160:00 there is an increased reinvestment in green field and brownfield development.
160:02 green field and brownfield development. But as these tools evolve, the proxies
160:05 But as these tools evolve, the proxies for these metrics will also evolve. But
160:07 for these metrics will also evolve. But hopefully this provides a MECI framework
160:10 hopefully this provides a MECI framework as an initial starting point.
160:12 as an initial starting point. So what's next? The future of course is
160:15 So what's next? The future of course is difficult to predict, let alone in the
160:17 difficult to predict, let alone in the next 5 years. But we hope that with our
160:20 next 5 years. But we hope that with our vision of a new software development
160:21 vision of a new software development model, even as agents increase in their
160:24 model, even as agents increase in their intelligence and humans become more
160:26 intelligence and humans become more fluent in AI, that this model still
160:29 fluent in AI, that this model still stands. So hopefully this model that
160:32 stands. So hopefully this model that includes um shorter sprints, smaller
160:35 includes um shorter sprints, smaller teams, but large u smaller but larger
160:37 teams, but large u smaller but larger number of teams will set enterprises up
160:40 number of teams will set enterprises up for success in the long term.
160:42 for success in the long term. >> So just leave you with some some key
160:44 >> So just leave you with some some key takeaways. um start now. I would say to
160:48 takeaways. um start now. I would say to to our our clients, this is a human
160:50 to our our clients, this is a human change and it takes some times and it's
160:52 change and it takes some times and it's a big change and and it's going to be a
160:55 a big change and and it's going to be a journey and so I think um this is
160:57 journey and so I think um this is something that everyone needs to go on.
160:59 something that everyone needs to go on. I think it's also important to figure
161:01 I think it's also important to figure out which model works for you and set a
161:03 out which model works for you and set a really bold ambition and with that say
161:06 really bold ambition and with that say thank you so much for listening to us
161:08 thank you so much for listening to us and and uh we have an article here if
161:10 and and uh we have an article here if you're more interested in in the
161:12 you're more interested in in the research that we've conducted. Thank you
161:14 research that we've conducted. Thank you so much for having us. Our [applause]
161:28 next presenter is a researcher at Stanford who studies how AI impacts over
161:31 Stanford who studies how AI impacts over 100,000 developers in the real world.
161:35 100,000 developers in the real world. Please welcome Jaor Dennis Blanch.
161:55 So companies spend millions on AI tools for software engineering. But do we
161:57 for software engineering. But do we actually know how well these tools work
161:59 actually know how well these tools work in the enterprise or are these tools
162:01 in the enterprise or are these tools just all hype? to answer this and for
162:05 just all hype? to answer this and for the past two years we've been
162:06 the past two years we've been researching the impact of AI on software
162:08 researching the impact of AI on software engineering productivity and our
162:11 engineering productivity and our research is time series because we look
162:13 research is time series because we look at get historical data meaning we can go
162:15 at get historical data meaning we can go back in time and it's also
162:17 back in time and it's also cross-sectional because we cut across
162:19 cross-sectional because we cut across companies and the way we use to measure
162:23 companies and the way we use to measure most of the of the impact is by a
162:25 most of the of the impact is by a machine learning model that replicates a
162:27 machine learning model that replicates a panel of human experts. The way this
162:30 panel of human experts. The way this works is that imagine you have a
162:33 works is that imagine you have a software engineer who writes a code
162:35 software engineer who writes a code commit and this code commit would be
162:37 commit and this code commit would be evaluated by multiple panels or of 10
162:40 evaluated by multiple panels or of 10 and 15 independent experts who would
162:42 and 15 independent experts who would evaluate that code commit across
162:45 evaluate that code commit across implementation time maintainability and
162:47 implementation time maintainability and complexity and then produce an output
162:49 complexity and then produce an output evaluation. So we took the labels of
162:52 evaluation. So we took the labels of these panels across you know millions of
162:54 these panels across you know millions of of kind of evaluations and then trained
162:56 of kind of evaluations and then trained a model to replicate this panel of
162:58 a model to replicate this panel of experts meaning that we can deploy this
163:00 experts meaning that we can deploy this at scale and if there's ever any doubts
163:03 at scale and if there's ever any doubts around the models output you can always
163:05 around the models output you can always kind of assemble your own panel and see
163:07 kind of assemble your own panel and see that it correlates pretty well with
163:08 that it correlates pretty well with reality.
163:10 reality. Today we'll talk about four things.
163:12 Today we'll talk about four things. We'll start off with looking at some of
163:14 We'll start off with looking at some of the things that are driving AI
163:16 the things that are driving AI productivity gains in software. Then
163:18 productivity gains in software. Then we'll look at a AI practices benchmark
163:21 we'll look at a AI practices benchmark that we developed. We'll then look at
163:24 that we developed. We'll then look at how we propose to measure AI return on
163:27 how we propose to measure AI return on investment in software engineering. And
163:29 investment in software engineering. And lastly, we'll finish things off with a
163:31 lastly, we'll finish things off with a case study.
163:33 case study. So here we took 46 teams that were using
163:37 So here we took 46 teams that were using AI and we matched them with 46 similar
163:40 AI and we matched them with 46 similar teams that were not using AI and we
163:43 teams that were not using AI and we measured their net productivity gains
163:45 measured their net productivity gains from AI quarterly. And the shaded area
163:49 from AI quarterly. And the shaded area is the middle 50% of the data and the
163:51 is the middle 50% of the data and the dark blue line is the median which as of
163:53 dark blue line is the median which as of July of this year stands at about 10%
163:55 July of this year stands at about 10% for this cohort.
163:58 for this cohort. I'd like to direct your attention to the
163:59 I'd like to direct your attention to the fact that the discrepancy between the
164:02 fact that the discrepancy between the top performers and the bottom ones is
164:04 top performers and the bottom ones is increasing. There's a widening gap. And
164:07 increasing. There's a widening gap. And so if we very unscientifically and very
164:10 so if we very unscientifically and very illustratively project this forward, we
164:12 illustratively project this forward, we might get something like this, right?
164:14 might get something like this, right? Where uh you can have these top
164:15 Where uh you can have these top performers being part of this the rich
164:18 performers being part of this the rich gets richer effect where they these
164:20 gets richer effect where they these successful early AI adopters might
164:22 successful early AI adopters might compound their gains while these
164:24 compound their gains while these strugglers could fall further behind. At
164:26 strugglers could fall further behind. At some point this is going to converge and
164:28 some point this is going to converge and this is very directional. But my point
164:30 this is very directional. But my point here is that if you're a leader in a
164:32 here is that if you're a leader in a company, you definitely need to know in
164:34 company, you definitely need to know in which cohort you are right now so that
164:35 which cohort you are right now so that you can course correct. And without
164:37 you can course correct. And without measuring the impact of AI on your
164:40 measuring the impact of AI on your engineers, you're not going to be able
164:41 engineers, you're not going to be able to do this.
164:44 to do this. So we started investigating what are
164:46 So we started investigating what are some of the factors that drive these top
164:48 some of the factors that drive these top teams to perform better. And the first
164:50 teams to perform better. And the first thing we looked at is AI usage or
164:52 thing we looked at is AI usage or basically token spent. In this graph you
164:55 basically token spent. In this graph you have the same kind of on the vertical
164:58 have the same kind of on the vertical axis the productivity increase and then
165:00 axis the productivity increase and then on the horizontal one you have the token
165:02 on the horizontal one you have the token usage per engineer per month on a
165:03 usage per engineer per month on a logarithmic scale. And what you can see
165:06 logarithmic scale. And what you can see is that the correlation is quite loose
165:08 is that the correlation is quite loose 20 or so linearly. And there is a bit of
165:11 20 or so linearly. And there is a bit of a death valley effect around the 10
165:13 a death valley effect around the 10 million uh token mark whereby com teams
165:16 million uh token mark whereby com teams that were using that amount of tokens
165:17 that were using that amount of tokens seem to be doing worse than teams that
165:19 seem to be doing worse than teams that were using a bit less tokens. It's very
165:21 were using a bit less tokens. It's very directional but interesting
165:22 directional but interesting nevertheless.
165:24 nevertheless. The conclusion here might be that AI
165:26 The conclusion here might be that AI usage quality matters more than AI usage
165:30 usage quality matters more than AI usage value.
165:32 value. We dug deeper and we said well does the
165:35 We dug deeper and we said well does the environment in which the engineers work
165:37 environment in which the engineers work impact the productivity from AI and we
165:40 impact the productivity from AI and we came up with an environment cleaniness
165:42 came up with an environment cleaniness index index. It's quite experimental.
165:44 index index. It's quite experimental. It's a composite score that looks at
165:46 It's a composite score that looks at tests looks at uh types at documentation
165:49 tests looks at uh types at documentation and at modularity and at code quality.
165:52 and at modularity and at code quality. And that index is on the bottom axis
165:54 And that index is on the bottom axis here from 0 to one. And then on the
165:56 here from 0 to one. And then on the vertical axis once again you have the
165:57 vertical axis once again you have the kind of productivity lift relative to
165:59 kind of productivity lift relative to teams not using AI. And so what you can
166:02 teams not using AI. And so what you can see is that there's a40 R squar meaning
166:05 see is that there's a40 R squar meaning a pretty decent correlation around
166:07 a pretty decent correlation around environment cleanliness and gains from
166:10 environment cleanliness and gains from uh AI or productivity gains from using
166:13 uh AI or productivity gains from using AI. And so the takeaway here is to
166:15 AI. And so the takeaway here is to invest in codebased hygiene to unlock
166:18 invest in codebased hygiene to unlock these AI productivity gains.
166:21 these AI productivity gains. We dug deeper to illustrate this
166:23 We dug deeper to illustrate this concept. And here we have on this graph
166:26 concept. And here we have on this graph on the vertical axis the percentage of
166:28 on the vertical axis the percentage of tasks that might uh be able to be
166:31 tasks that might uh be able to be completed by AI based on three colors.
166:33 completed by AI based on three colors. And so green means that AI can do most
166:36 And so green means that AI can do most of the work for that task in that
166:37 of the work for that task in that sprint. Yellow means that AI can help
166:40 sprint. Yellow means that AI can help someone and red uh means that AI is not
166:43 someone and red uh means that AI is not very useful. And this is quite
166:44 very useful. And this is quite illustrative but it it conveys the
166:46 illustrative but it it conveys the point. And so then any code base at any
166:48 point. And so then any code base at any point in time sits on a vertical line
166:50 point in time sits on a vertical line across this graphic. And what you can
166:52 across this graphic. And what you can see is that clean code amplifies AI
166:55 see is that clean code amplifies AI gains.
166:56 gains. Secondly is that you need to manage your
166:59 Secondly is that you need to manage your codebase entropy, right? Your codebase
167:01 codebase entropy, right? Your codebase tech debt because if you just use AI
167:04 tech debt because if you just use AI unchecked, this is going to accelerate
167:06 unchecked, this is going to accelerate this entropy which is going to push and
167:08 this entropy which is going to push and degrade your cleaniness to the left kind
167:10 degrade your cleaniness to the left kind of right. And then you as as a human
167:12 of right. And then you as as a human need to push on the other side to kind
167:14 need to push on the other side to kind of improve or maintain that cleanliness
167:16 of improve or maintain that cleanliness to keep reaping the benefits from AI.
167:20 to keep reaping the benefits from AI. Thirdly is that it's important that
167:21 Thirdly is that it's important that engineers need to know when to use AI
167:24 engineers need to know when to use AI and when not to use AI. And what happens
167:26 and when not to use AI. And what happens when they don't is this kind of line on
167:29 when they don't is this kind of line on the left whereby you have AI AI outputs
167:32 the left whereby you have AI AI outputs that are rejected or need heavy
167:34 that are rejected or need heavy rewriting which then leads to engineers
167:37 rewriting which then leads to engineers losing trust in AI saying okay this just
167:39 losing trust in AI saying okay this just doesn't work. I'm not going to use it.
167:40 doesn't work. I'm not going to use it. Which then further collapses your AI
167:42 Which then further collapses your AI gains.
167:51 Now, we said, can we find out whether we can look not only at usage but at how
167:53 can look not only at usage but at how are these companies and these engineers
167:55 are these companies and these engineers using AI? And we came up with an AI
167:59 using AI? And we came up with an AI engineering practices benchmark. The way
168:01 engineering practices benchmark. The way this works is that we can scan your
168:03 this works is that we can scan your codebase and detect these AI
168:05 codebase and detect these AI fingerprints or artifacts. Basically,
168:07 fingerprints or artifacts. Basically, traces of how your team is using AI.
168:10 traces of how your team is using AI. It's quite directional at this point,
168:11 It's quite directional at this point, but evolving. And we can quantify this
168:15 but evolving. And we can quantify this based on the percentage of your active
168:17 based on the percentage of your active engineering work that uses each AI
168:19 engineering work that uses each AI pattern. And then we kind of repeat this
168:21 pattern. And then we kind of repeat this monthly using get history. And the way
168:24 monthly using get history. And the way this works is more or less you have kind
168:26 this works is more or less you have kind of a few levels. And level zero might be
168:28 of a few levels. And level zero might be how humans are just not using AI and
168:30 how humans are just not using AI and write all of the code. Level one is kind
168:33 write all of the code. Level one is kind of like personal use where engineers are
168:35 of like personal use where engineers are not sharing prompts across the team or
168:38 not sharing prompts across the team or not versioning them. Level two is team
168:40 not versioning them. Level two is team use whereby teams are are sharing these
168:43 use whereby teams are are sharing these kind of prompts and rules. And then
168:45 kind of prompts and rules. And then level three is even more sophisticated.
168:46 level three is even more sophisticated. It's where AI autonomously does specific
168:49 It's where AI autonomously does specific tasks maybe not the entire workflow. And
168:51 tasks maybe not the entire workflow. And level four is you know agentic
168:53 level four is you know agentic orchestration which is where AI just
168:55 orchestration which is where AI just runs the entire process. And so this is
168:57 runs the entire process. And so this is going to be an open- source tool which
168:59 going to be an open- source tool which you can leverage if you sign up on the
169:01 you can leverage if you sign up on the sweeper research portal.
169:05 sweeper research portal. We applied this benchmark to one of the
169:08 We applied this benchmark to one of the companies in our research data set and
169:10 companies in our research data set and we saw this this company had two
169:12 we saw this this company had two business units with equal access to AI
169:15 business units with equal access to AI tools, right? Same licenses, same spend,
169:18 tools, right? Same licenses, same spend, same tools, same everything. But the
169:20 same tools, same everything. But the adoption rate and the usage rate was
169:22 adoption rate and the usage rate was very different by business unit. On the
169:24 very different by business unit. On the left, the first business unit, you can
169:27 left, the first business unit, you can as you can see in the area in the blue,
169:28 as you can see in the area in the blue, seemed to be using AI a lot more for
169:31 seemed to be using AI a lot more for almost 40% of their work. Whereas on the
169:34 almost 40% of their work. Whereas on the on the uh right, the second business
169:36 on the uh right, the second business unit seem to struggle behind a bit more.
169:39 unit seem to struggle behind a bit more. And so the takeaway here is that access
169:41 And so the takeaway here is that access to AI and even AI usage doesn't mean or
169:45 to AI and even AI usage doesn't mean or doesn't guarantee that that AI is going
169:48 doesn't guarantee that that AI is going to be used in the same way across a
169:50 to be used in the same way across a company.
169:52 company. As a leader, you really want to be
169:53 As a leader, you really want to be understanding not just whether they're
169:55 understanding not just whether they're using but also how your engineers are
169:57 using but also how your engineers are using AI.
170:03 Great. Now let's dive into how do we actually measure AI return on investment
170:06 actually measure AI return on investment in software engineering.
170:14 Oh uh there we go. Okay. So here ideally we would be measuring this based on
170:15 we would be measuring this based on business outcomes, right? I give my AI
170:18 business outcomes, right? I give my AI engineer my engineers AI and then I make
170:21 engineer my engineers AI and then I make more money, more revenue, net revenue
170:23 more money, more revenue, net revenue retention, whatever business KPI you
170:24 retention, whatever business KPI you want to track. The problem is that
170:27 want to track. The problem is that there's too much noise between the
170:29 there's too much noise between the treatment right giving AI and the result
170:32 treatment right giving AI and the result which is the business outcome. And on
170:34 which is the business outcome. And on top of this there's confounding
170:36 top of this there's confounding variables such as your sales execution,
170:38 variables such as your sales execution, the macro environment, your product
170:39 the macro environment, your product strategy and therefore although that
170:42 strategy and therefore although that would be ideal unfortunately uh I think
170:44 would be ideal unfortunately uh I think we need to find alternative paths and
170:47 we need to find alternative paths and the most logical one is to simply look
170:48 the most logical one is to simply look at the engineering outcomes because
170:50 at the engineering outcomes because there is a clear signal right but here
170:52 there is a clear signal right but here we need to go beyond measuring AI usage
170:55 we need to go beyond measuring AI usage into measuring engineering outcomes.
170:57 into measuring engineering outcomes. There's a few caveats and this topic is
170:59 There's a few caveats and this topic is quite heavily discussed and so I want to
171:01 quite heavily discussed and so I want to mention some of them.
171:03 mention some of them. The first one is that this is assuming
171:05 The first one is that this is assuming that our product function can properly
171:08 that our product function can properly direct that increased capacity into
171:09 direct that increased capacity into something that generates value. And if
171:11 something that generates value. And if they aren't directing that, then it's a
171:13 they aren't directing that, then it's a product problem, which although sits
171:16 product problem, which although sits quite close to engineering, it's
171:18 quite close to engineering, it's slightly different, right?
171:20 slightly different, right? The second caveat is that this assumes
171:22 The second caveat is that this assumes that engineering is a meaningful
171:23 that engineering is a meaningful bottleneck for value which frankly it
171:25 bottleneck for value which frankly it typically is and that you can guard
171:27 typically is and that you can guard against good hards law by using a
171:30 against good hards law by using a balanced set of metrics and also by
171:32 balanced set of metrics and also by having a good company culture that
171:34 having a good company culture that doesn't weaponize these metrics.
171:36 doesn't weaponize these metrics. And thirdly is that AI is still very new
171:39 And thirdly is that AI is still very new and measuring proxy metrics is still
171:42 and measuring proxy metrics is still better than not measuring. There's going
171:44 better than not measuring. There's going to be winners and losers in this AI
171:46 to be winners and losers in this AI race. And progress is better than
171:48 race. And progress is better than perfection here. And so metrics don't
171:50 perfection here. And so metrics don't need to be flawless to be useful is what
171:52 need to be flawless to be useful is what I want to illustrate.
172:03 So then um here we have uh two parts which you need to do to get the ROI from
172:05 which you need to do to get the ROI from AI, right? You kind of need to measure
172:07 AI, right? You kind of need to measure usage and then you need to measure
172:08 usage and then you need to measure engineering outcomes. And so let's start
172:12 engineering outcomes. And so let's start with usage.
172:14 with usage. There's really two buckets for
172:15 There's really two buckets for enterprises. There's kind of more in a
172:17 enterprises. There's kind of more in a research environment, but to make it
172:18 research environment, but to make it simple, there's access based and there's
172:20 simple, there's access based and there's usage based. Accessbased is basically
172:22 usage based. Accessbased is basically looking at when did people get access to
172:25 looking at when did people get access to the tool. And here we have you can kind
172:27 the tool. And here we have you can kind of do a pilot group, give that group AI
172:30 of do a pilot group, give that group AI and then compare it to a similar group
172:32 and then compare it to a similar group without AI or you can measure the same
172:34 without AI or you can measure the same team across time. The problem is that
172:36 team across time. The problem is that access based is noisy and the gold
172:39 access based is noisy and the gold standard is really usage based which uh
172:42 standard is really usage based which uh uses telemetry from APIs from these
172:45 uses telemetry from APIs from these coding assistants right to uh give you
172:48 coding assistants right to uh give you the right data to know who's using AI
172:50 the right data to know who's using AI and and where and the caveat here is
172:52 and and where and the caveat here is that the vendor API is different
172:54 that the vendor API is different unfortunately tools like GitHub copilot
172:56 unfortunately tools like GitHub copilot aggregate the data and other tools like
172:58 aggregate the data and other tools like cursor give you more granular data
173:01 cursor give you more granular data the big takeaway is that you can measure
173:03 the big takeaway is that you can measure impact of um retroactively by using get
173:07 impact of um retroactively by using get history. And so you don't need to set up
173:09 history. And so you don't need to set up an experiment now and wait 6 months. You
173:11 an experiment now and wait 6 months. You can actually if you've already adopted
173:12 can actually if you've already adopted AI, you can go back in time and and and
173:15 AI, you can go back in time and and and do this. It's quite easy.
173:18 do this. It's quite easy. Now we've seen usage. Let's look into
173:20 Now we've seen usage. Let's look into how do we actually measure engineering
173:21 how do we actually measure engineering outcomes? What are some of the metrics
173:23 outcomes? What are some of the metrics we propose?
173:33 Here we have um our framework which we proposed which is using a primary metric
173:35 proposed which is using a primary metric and a guardrail metric. And so here um
173:37 and a guardrail metric. And so here um the primary metric is engineering
173:39 the primary metric is engineering output. It's not lines of code. It's not
173:41 output. It's not lines of code. It's not PR counts and it's not dura. And it's
173:43 PR counts and it's not dura. And it's basically based on this machine learning
173:45 basically based on this machine learning model that replicates the panel of
173:46 model that replicates the panel of experts, right? And the second set of
173:48 experts, right? And the second set of metrics are the guard ones which you
173:51 metrics are the guard ones which you want to maintain at a healthy level but
173:53 want to maintain at a healthy level but you don't want to maximize. It doesn't
173:54 you don't want to maximize. It doesn't make sense to maximize them truly. And
173:57 make sense to maximize them truly. And so then there's three categories within
173:59 so then there's three categories within the guardrail ones rework and
174:01 the guardrail ones rework and refactoring quality tech and risk and
174:03 refactoring quality tech and risk and then people and devops. The third bucket
174:05 then people and devops. The third bucket is important to highlight that these are
174:07 is important to highlight that these are not productivity metrics. They're useful
174:09 not productivity metrics. They're useful but you cannot just kind of use them
174:11 but you cannot just kind of use them like maximize them to maximize developer
174:13 like maximize them to maximize developer productivity. They kind of fall off at
174:15 productivity. They kind of fall off at some point. And so the goal here might
174:17 some point. And so the goal here might be to keep your guardrail metrics
174:18 be to keep your guardrail metrics healthy while increasing the primary
174:20 healthy while increasing the primary metric to whatever degree possible.
174:24 metric to whatever degree possible. Now let's dive into a case study. Here
174:28 Now let's dive into a case study. Here we worked with
174:31 we worked with a company that uh large enterprise. We
174:34 a company that uh large enterprise. We took a team of uh 350 people under a
174:36 took a team of uh 350 people under a vice president and we measured pull
174:38 vice president and we measured pull requests. The reason we did this is to
174:41 requests. The reason we did this is to illustrate that you cannot measure pull
174:43 illustrate that you cannot measure pull requests to understand whether AI is
174:45 requests to understand whether AI is helping you. And so here this team
174:47 helping you. And so here this team adopted um AI in May of this year and we
174:49 adopted um AI in May of this year and we measured the four months before four
174:51 measured the four months before four months after. We saw a 14% increase.
174:54 months after. We saw a 14% increase. Great. That's fantastic. But what about
174:56 Great. That's fantastic. But what about reviewer burden? What about code
174:58 reviewer burden? What about code quality? So we measured code quality.
175:01 quality? So we measured code quality. And here what we saw is um I mean
175:04 And here what we saw is um I mean firstly actually code quality think of
175:06 firstly actually code quality think of it as maintainability scale from 0 to
175:08 it as maintainability scale from 0 to 10. And uh there's kind of these bands.
175:11 10. And uh there's kind of these bands. Uh it uses our our methodology. You can
175:13 Uh it uses our our methodology. You can read it online. But basically what you
175:16 read it online. But basically what you see is that in the preAI period their
175:18 see is that in the preAI period their code quality was quite stable and
175:20 code quality was quite stable and consistent. And once they adopted AI,
175:22 consistent. And once they adopted AI, two things happened. Code quality
175:23 two things happened. Code quality decreased and then code quality became
175:25 decreased and then code quality became more erratic.
175:31 Next, we took a look at our metric, which is engineering output. It's not
175:33 which is engineering output. It's not lines of code. And here for every month,
175:35 lines of code. And here for every month, you see the sigma, the sum of the output
175:38 you see the sigma, the sum of the output delivered for that month, broken down
175:39 delivered for that month, broken down into four buckets. Rework and
175:42 into four buckets. Rework and refactoring. So rework is when you're
175:44 refactoring. So rework is when you're changing or editing code that was it's
175:47 changing or editing code that was it's still kind of fresh, so it's recent.
175:48 still kind of fresh, so it's recent. refactoring is when you're changing code
175:50 refactoring is when you're changing code that's a bit older and uh what uh then
175:54 that's a bit older and uh what uh then like added and removed it's pretty
175:56 like added and removed it's pretty self-explanatory and then also you can
175:58 self-explanatory and then also you can see these kind of benchmarks so we can
176:00 see these kind of benchmarks so we can benchmark this company against similar
176:01 benchmark this company against similar companies in their industry and here AI
176:04 companies in their industry and here AI usage had two effects firstly is that
176:06 usage had two effects firstly is that rework went up by 2.5 times which is
176:09 rework went up by 2.5 times which is really bad and effective output which is
176:11 really bad and effective output which is kind of like a proxy for productivity or
176:13 kind of like a proxy for productivity or so didn't really change
176:15 so didn't really change and so then what's the conclusion here
176:17 and so then what's the conclusion here let's do a recap app. So we saw that PRs
176:20 let's do a recap app. So we saw that PRs went up by 14%. But this is inconclusive
176:23 went up by 14%. But this is inconclusive because more PRs doesn't mean better. We
176:26 because more PRs doesn't mean better. We saw that code quality decreased by 9%
176:28 saw that code quality decreased by 9% which is problematic. We saw that
176:30 which is problematic. We saw that effective output didn't increase
176:32 effective output didn't increase meaningfully. And then we saw that
176:34 meaningfully. And then we saw that rework increased by a lot. And so then
176:36 rework increased by a lot. And so then the question here is what is the ROI of
176:39 the question here is what is the ROI of this AI adoption, right? It might be
176:41 this AI adoption, right? It might be negative. And what I want to point out
176:42 negative. And what I want to point out here is that had this company not
176:44 here is that had this company not measured this more thoroughly and simply
176:46 measured this more thoroughly and simply measured PR counts, they would have
176:48 measured PR counts, they would have thought, hey, we're doing great. We
176:50 thought, hey, we're doing great. We increased our productivity by 14%. Let's
176:52 increased our productivity by 14%. Let's run from the numbers. That's how many
176:54 run from the numbers. That's how many million lots of millions of dollars. And
176:56 million lots of millions of dollars. And does this offset the AI license? Sure
176:58 does this offset the AI license? Sure thing it does, right? The other thing is
177:00 thing it does, right? The other thing is that I don't think this company should
177:02 that I don't think this company should abandon AI. They should simply use this
177:03 abandon AI. They should simply use this data to understand what they're doing
177:05 data to understand what they're doing wrong. How can they improve? Because AI
177:07 wrong. How can they improve? Because AI is here to stay. It's a tool that's
177:08 is here to stay. It's a tool that's going to transform how engineers are are
177:10 going to transform how engineers are are working, right? and you can just um kind
177:13 working, right? and you can just um kind of like abandon it or yourself.
177:16 of like abandon it or yourself. Great. So, this concludes our insights
177:19 Great. So, this concludes our insights for today. If you've enjoyed this uh
177:21 for today. If you've enjoyed this uh talk and you would like similar insights
177:23 talk and you would like similar insights for your company, I invite you to
177:24 for your company, I invite you to participate in our research. Everything
177:26 participate in our research. Everything you've seen today can uh be accessed
177:29 you've seen today can uh be accessed through kind of participating in our
177:30 through kind of participating in our research, some of them through live
177:31 research, some of them through live dashboards in our research portal. And
177:34 dashboards in our research portal. And especially I'd like to invite companies
177:36 especially I'd like to invite companies that have access to Cursor Enterprise to
177:39 that have access to Cursor Enterprise to participate because we have a high need
177:41 participate because we have a high need for this so we can publish papers around
177:43 for this so we can publish papers around the granularity of using AI um in
177:45 the granularity of using AI um in software engineering. You can sign up at
177:47 software engineering. You can sign up at software engineering
177:48 software engineering productivity.stanford.edu.
177:50 productivity.stanford.edu. Thank you so much. [applause]
178:04 [music] next speaker will separate hype from reality on AI code quality using
178:07 from reality on AI code quality using realworld data to show when AI generated
178:11 realworld data to show when AI generated code can be trusted in production.
178:14 code can be trusted in production. Please welcome CEO of Kodto, Edidomar
178:17 Please welcome CEO of Kodto, Edidomar Freriedman.
178:28 It will grow. It will grow one or two more months. I'm really excited being
178:30 more months. I'm really excited being here. So many so much pragmatic and
178:32 here. So many so much pragmatic and insight and suggestions. I was sitting
178:34 insight and suggestions. I was sitting there uh just just before. So I'm Edmar
178:37 there uh just just before. So I'm Edmar Freiedman, the CEO and co-founder of
178:38 Freiedman, the CEO and co-founder of Kodto. Codto stands for quality of
178:40 Kodto. Codto stands for quality of development and I'm going to share uh
178:43 development and I'm going to share uh our reports and other companies reports
178:45 our reports and other companies reports about state of AI code quality. uh you
178:48 about state of AI code quality. uh you know trying to uh talk about the hype
178:51 know trying to uh talk about the hype versus reality which was uh like one of
178:53 versus reality which was uh like one of the uh points that were discussed here
178:56 the uh points that were discussed here quite a lot which is awesome. So in the
178:58 quite a lot which is awesome. So in the last three weeks, four weeks, we saw
179:00 last three weeks, four weeks, we saw like three outages in the clouds
179:02 like three outages in the clouds unfortunately, right? And these are
179:05 unfortunately, right? And these are coming from companies that really care
179:07 coming from companies that really care about moving fast, right? They're
179:09 about moving fast, right? They're they're they're saying themselves that
179:11 they're they're saying themselves that they're using AI to generate code 10%,
179:14 they're using AI to generate code 10%, 30%, 50%, at the same time, they care
179:16 30%, 50%, at the same time, they care about quality. So how did that happen?
179:19 about quality. So how did that happen? And is it is it related? I don't know.
179:21 And is it is it related? I don't know. But let's have some I'm going to share
179:23 But let's have some I'm going to share some guess. So by the way 60% of
179:25 some guess. So by the way 60% of developers say that the like quarter of
179:28 developers say that the like quarter of their code is either generated by AI or
179:31 their code is either generated by AI or in in like uh uh shaped by I and 15% say
179:35 in in like uh uh shaped by I and 15% say that even more than 80 80% of their code
179:38 that even more than 80 80% of their code uh is basically generated or or shaped
179:40 uh is basically generated or or shaped by AI. Now people are using AI to do
179:45 by AI. Now people are using AI to do vibe coding but actually they're even
179:47 vibe coding but actually they're even doing it for vibe checking vibe
179:49 doing it for vibe checking vibe reviewing. This is the command of cloud.
179:53 reviewing. This is the command of cloud. This is the prompt for the command of
179:55 This is the prompt for the command of claude code for security review. It was
179:58 claude code for security review. It was hyped like two months ago. Do you know
179:59 hyped like two months ago. Do you know what I'm talking about now? It says
180:01 what I'm talking about now? It says there, I don't know if you see it. Uh,
180:03 there, I don't know if you see it. Uh, you are a senior security engineer.
180:05 you are a senior security engineer. Good. And then like somewhere there uh
180:08 Good. And then like somewhere there uh down the line, it says please exclude
180:11 down the line, it says please exclude denial of service. Don't don't uh catch
180:14 denial of service. Don't don't uh catch denial of service issues. Maybe that's
180:17 denial of service issues. Maybe that's part of the part of the reason like
180:19 part of the part of the reason like we're we're having uh cloud outages.
180:21 we're we're having uh cloud outages. probably not just that, but you get the
180:23 probably not just that, but you get the point. Like we need to be rigorous about
180:26 point. Like we need to be rigorous about how we deal with quality. It's not just
180:28 how we deal with quality. It's not just like vibe quality or or so like we're
180:31 like vibe quality or or so like we're doing vibe coding sometimes. Uh let's go
180:33 doing vibe coding sometimes. Uh let's go to another example. Okay, cursor I guess
180:37 to another example. Okay, cursor I guess like or or pilot most of you use rules,
180:40 like or or pilot most of you use rules, right? We're going to talk about it. You
180:41 right? We're going to talk about it. You invest in code generation. After a
180:43 invest in code generation. After a [snorts] while, you understand if you
180:44 [snorts] while, you understand if you invest, you'll get more out of it. And
180:47 invest, you'll get more out of it. And uh we we asked like a bunch of of
180:50 uh we we asked like a bunch of of developers and I'm asking you as well
180:52 developers and I'm asking you as well think for a second for all the
180:53 think for a second for all the developers there in the audience like
180:56 developers there in the audience like when you write cursor rules or copilot
180:58 when you write cursor rules or copilot rules etc. Do you feel they're
181:00 rules etc. Do you feel they're completely followed or it's like mostly
181:02 completely followed or it's like mostly followed? Do you know how much they're
181:04 followed? Do you know how much they're followed? And what extent are they
181:05 followed? And what extent are they followed? It's rigorously like how
181:07 followed? It's rigorously like how technical deep they're they're being
181:09 technical deep they're they're being followed. So the what we get back like
181:11 followed. So the what we get back like the answer from what you see here on the
181:13 the answer from what you see here on the screen is mostly like B, C, and D. They
181:16 screen is mostly like B, C, and D. They are followed but they're not completely
181:18 are followed but they're not completely followed. Okay. So that means like we
181:21 followed. Okay. So that means like we are generating code trying to push it to
181:23 are generating code trying to push it to the standards but it's not necessarily
181:25 the standards but it's not necessarily still like getting to the quality we
181:27 still like getting to the quality we wanted. I'm going to share a bit more
181:29 wanted. I'm going to share a bit more statistics and and information and some
181:31 statistics and and information and some insight from three reports. One done by
181:34 insight from three reports. One done by Codo, another by done by Sonar, another
181:38 Codo, another by done by Sonar, another by far and all of them are are focused
181:41 by far and all of them are are focused on code code quality review etc. The
181:44 on code code quality review etc. The sample size is thousands of developers
181:46 sample size is thousands of developers in some cases even more millions of pull
181:48 in some cases even more millions of pull requests and and a billion of of lines
181:51 requests and and a billion of of lines lines of code that were uh uh being
181:54 lines of code that were uh uh being checked. Like for example, if you think
181:56 checked. Like for example, if you think about uh Sonar, this is a company, yeah,
181:59 about uh Sonar, this is a company, yeah, a bit like coming from pre-AI, but they
182:02 a bit like coming from pre-AI, but they see code at scale and you they're doing
182:06 see code at scale and you they're doing like a lot of checks in code that are
182:09 like a lot of checks in code that are not necessarily AI focused, but are
182:11 not necessarily AI focused, but are necessary in order to check uh your your
182:14 necessary in order to check uh your your software from all possible direction.
182:17 software from all possible direction. And that's why their scaling and the
182:19 And that's why their scaling and the scale of the code that you're seeing is
182:20 scale of the code that you're seeing is is immense. Okay. So for example, we
182:22 is immense. Okay. So for example, we took information from from their report
182:25 took information from from their report and eventually my purpose here is to
182:27 and eventually my purpose here is to break down the different dimension of
182:29 break down the different dimension of what uh code quality means and give you
182:31 what uh code quality means and give you some share some stats and and insights.
182:34 some share some stats and and insights. I want to start with the end. Okay, this
182:38 I want to start with the end. Okay, this is the takeaway I want you all all like
182:40 is the takeaway I want you all all like to take from from the next 13 minutes
182:42 to take from from the next 13 minutes that I have. We started with code
182:45 that I have. We started with code generation. We like out of the box use
182:48 generation. We like out of the box use it autocomplete etc. and you invest in
182:50 it autocomplete etc. and you invest in it and you can get more out of it. But
182:54 it and you can get more out of it. But there's the glass ceiling for how much
182:55 there's the glass ceiling for how much productivity you can get from code
182:57 productivity you can get from code generation. And then we move to the
182:59 generation. And then we move to the agent code generation, right? Let's call
183:02 agent code generation, right? Let's call it gen 2.0. And that's a higher glass
183:04 it gen 2.0. And that's a higher glass ceiling. It could do much more
183:06 ceiling. It could do much more productivity and especially if you
183:08 productivity and especially if you invest in it, for example, rules, etc.
183:11 invest in it, for example, rules, etc. Then with AI breaking outside of the
183:15 Then with AI breaking outside of the IDE, we can start using AI also for code
183:19 IDE, we can start using AI also for code for agentic quality workflows. It could
183:23 for agentic quality workflows. It could be inside the ID, but the the truth is
183:25 be inside the ID, but the the truth is that if you think about all the
183:26 that if you think about all the workflows you have in your organization,
183:28 workflows you have in your organization, especially if you're more than 100
183:29 especially if you're more than 100 developers or so, you probably have a
183:31 developers or so, you probably have a lot of workflows that you related to
183:33 lot of workflows that you related to quality that you need to auto automate.
183:35 quality that you need to auto automate. And that's where you start like breaking
183:38 And that's where you start like breaking through the glass ceiling of
183:40 through the glass ceiling of productivity. if you invest in it. And
183:42 productivity. if you invest in it. And finally, I I claim that you need those
183:45 finally, I I claim that you need those agentic workflows. Keep learning. And we
183:48 agentic workflows. Keep learning. And we might touch a little bit of that like
183:50 might touch a little bit of that like later later on. Okay? Like because
183:52 later later on. Okay? Like because quality is something dynamic. So you'll
183:54 quality is something dynamic. So you'll only finally break break the glass
183:57 only finally break break the glass ceiling if if you really have those
183:59 ceiling if if you really have those quality workflows and rules and standard
184:01 quality workflows and rules and standard being dynamic. And then then you will
184:04 being dynamic. And then then you will see the promised 2x let alone the 10x
184:07 see the promised 2x let alone the 10x that you were promised the hyped and and
184:08 that you were promised the hyped and and you you heard from McKenzie and from
184:10 you you heard from McKenzie and from Stanford you're not getting that. I
184:12 Stanford you're not getting that. I don't need to tell you the 2x 10x for
184:14 don't need to tell you the 2x 10x for the entire software development uh life
184:16 the entire software development uh life cycle. So a bit about more about the the
184:19 cycle. So a bit about more about the the market adoption. Uh one of the report
184:22 market adoption. Uh one of the report says that 82% of adoption already for AI
184:27 says that 82% of adoption already for AI dev tools are being used daily or
184:28 dev tools are being used daily or weekly. uh some people at 60 60% 59
184:32 weekly. uh some people at 60 60% 59 report that they're using more than
184:34 report that they're using more than three and 20% saying that they're using
184:36 three and 20% saying that they're using more than five code generation tools. If
184:38 more than five code generation tools. If you think about it for a second uh don't
184:40 you think about it for a second uh don't only take like cursor compil
184:43 only take like cursor compil etc. Sorry if I'm insulting anyone in
184:45 etc. Sorry if I'm insulting anyone in the that I forgot their tool but there's
184:47 the that I forgot their tool but there's also the lovable etc. They also generate
184:50 also the lovable etc. They also generate code and by the way you're going to get
184:51 code and by the way you're going to get to 10 I'm count on me you're going to
184:53 to 10 I'm count on me you're going to get to 10 tools in two three years that
184:56 get to 10 tools in two three years that generate code for you okay come to talk
184:58 generate code for you okay come to talk to me about later I'll try to convince
184:59 to me about later I'll try to convince you and and the thing is that it it's
185:02 you and and the thing is that it it's coming from bottom up like 50% of the
185:04 coming from bottom up like 50% of the usage is coming from less than 10 teams
185:07 usage is coming from less than 10 teams that are less than 10 developers but it
185:09 that are less than 10 developers but it is propagating also to the enterprise
185:11 is propagating also to the enterprise again I'm sure you know I mean talk
185:14 again I'm sure you know I mean talk propagating to the enterprise at scale
185:15 propagating to the enterprise at scale like not just like five developers in
185:17 like not just like five developers in the last year we're seeing like more and
185:19 the last year we're seeing like more and more enterprise using co code
185:20 more enterprise using co code generation. Uh so if like an in average
185:24 generation. Uh so if like an in average with within reports we saw 82 to 92%
185:28 with within reports we saw 82 to 92% using weekly to a monthly uh code
185:30 using weekly to a monthly uh code generation tools and in some cases maybe
185:33 generation tools and in some cases maybe extreme maybe not we're going to talk
185:35 extreme maybe not we're going to talk about it we saw 3x productivity boost in
185:39 about it we saw 3x productivity boost in writing code okay but that doesn't mean
185:42 writing code okay but that doesn't mean that if you have uh 3x productivity in
185:44 that if you have uh 3x productivity in writing code that you actually guarantee
185:46 writing code that you actually guarantee any quality like I presented before so
185:49 any quality like I presented before so actually 67% of the developer that we as
185:52 actually 67% of the developer that we as asked have serious equality concerns
185:55 asked have serious equality concerns about all the AI generated all the
185:57 about all the AI generated all the generated code uh uh the code generated
186:00 generated code uh uh the code generated by AI or influenced by AI and they're
186:03 by AI or influenced by AI and they're claiming that they're missing the
186:04 claiming that they're missing the framework how to deal with quality how
186:07 framework how to deal with quality how to measure quality it's a big question
186:09 to measure quality it's a big question what is quality I'm going to talk about
186:11 what is quality I'm going to talk about it in the next few slides okay think
186:12 it in the next few slides okay think about it for a second before I break
186:14 about it for a second before I break break it down what what is quality
186:17 break it down what what is quality um so what we're actually saying that
186:19 um so what we're actually saying that the crisis with VIP coding uh viable
186:22 the crisis with VIP coding uh viable coding we're seeing it shifting and
186:24 coding we're seeing it shifting and evolving is that you're getting like
186:27 evolving is that you're getting like more task being done like 20 some report
186:30 more task being done like 20 some report 20% more task you know velocity and like
186:34 20% more task you know velocity and like 97 more% or so of PRs being opened and
186:38 97 more% or so of PRs being opened and eventually it takes more time to review
186:40 eventually it takes more time to review PR like 90% more time to review PR and
186:44 PR like 90% more time to review PR and by the way like there's a lot of
186:45 by the way like there's a lot of statistics about AI generating code at
186:48 statistics about AI generating code at least there's not less amount amount of
186:51 least there's not less amount amount of bugs per line of code. I'm not claiming
186:52 bugs per line of code. I'm not claiming that there are more, but even if there's
186:54 that there are more, but even if there's not less bugs per line of code, you have
186:57 not less bugs per line of code, you have much more bugs because there are much
186:58 much more bugs because there are much more PRs, much more code being
187:00 more PRs, much more code being generated, etc. Right? So that that's a
187:02 generated, etc. Right? So that that's a problem for the reviewer. So it's
187:04 problem for the reviewer. So it's somebody surprised it takes more time to
187:06 somebody surprised it takes more time to review these, especially in the age of
187:08 review these, especially in the age of agents, right? When five minutes calling
187:11 agents, right? When five minutes calling to cloud code, I have 1,000 line of code
187:13 to cloud code, I have 1,000 line of code after 5 minutes. Once upon a time, it
187:15 after 5 minutes. Once upon a time, it took me like hours to write 10 proper
187:17 took me like hours to write 10 proper lines of code. Right? Now let's zoom out
187:19 lines of code. Right? Now let's zoom out for a second. Code generation is
187:21 for a second. Code generation is magnificent. Okay? Like it it's a
187:24 magnificent. Okay? Like it it's a gamecher when you're talking about green
187:26 gamecher when you're talking about green field. You saw people talk about it a
187:28 field. You saw people talk about it a few slides a few minutes before me. Uh
187:32 few slides a few minutes before me. Uh it it revolutionized how we do p proof
187:34 it it revolutionized how we do p proof of concept uh project etc. But when
187:37 of concept uh project etc. But when you're dealing with heavyduty software
187:40 you're dealing with heavyduty software then you you like it or not we are
187:42 then you you like it or not we are dealing with a lot of things when uh
187:45 dealing with a lot of things when uh when you serve millions of clients you
187:47 when you serve millions of clients you have financial transactions when you're
187:49 have financial transactions when you're doing transportation you're dealing with
187:51 doing transportation you're dealing with code integrity if you like code
187:53 code integrity if you like code governance uh review standards testing
187:56 governance uh review standards testing relability etc. That's what we need to
187:59 relability etc. That's what we need to uh uh to deal with. Now let's break that
188:02 uh uh to deal with. Now let's break that under the surface part of the glacier
188:04 under the surface part of the glacier into two dimensions. This is one
188:06 into two dimensions. This is one dimension you can look on the quality
188:09 dimension you can look on the quality issues in throughout the software
188:11 issues in throughout the software development life cycle like planning and
188:14 development life cycle like planning and then development writing code review
188:17 then development writing code review code review is a bit of a process but
188:19 code review is a bit of a process but like what you're like checking quality
188:22 like what you're like checking quality that's part of the process of code
188:23 that's part of the process of code review testing which is another part of
188:25 review testing which is another part of of quality and and deployment and I know
188:28 of quality and and deployment and I know I didn't cover the entire like uh
188:31 I didn't cover the entire like uh software development life cycle but just
188:32 software development life cycle but just to give you an example and each one of
188:34 to give you an example and each one of them like possess like introduce new
188:37 them like possess like introduce new problems that are coming because you're
188:39 problems that are coming because you're using more and more AI generated code.
188:41 using more and more AI generated code. Um now another dimension to look at it
188:44 Um now another dimension to look at it is actually code level problems and
188:46 is actually code level problems and process level problems. Okay, I'm not
188:50 process level problems. Okay, I'm not I'm not opening the you know list of
188:52 I'm not opening the you know list of functional just opening the list of
188:54 functional just opening the list of non-functional. You're talking about
188:56 non-functional. You're talking about security inefficiency that are not
188:59 security inefficiency that are not necessarily uh functional. Use I'll show
189:02 necessarily uh functional. Use I'll show you some statistics about that. And then
189:04 you some statistics about that. And then process level is for example learning.
189:07 process level is for example learning. Hey if you will have a
189:11 Hey if you will have a a a bad outage because of AI generated
189:14 a a bad outage because of AI generated code who is responsible is it the AI or
189:16 code who is responsible is it the AI or or the team that own that okay like you
189:20 or the team that own that okay like you need to learn and own the code
189:21 need to learn and own the code eventually that's a process that needs
189:23 eventually that's a process that needs to be done verification porting
189:25 to be done verification porting guardrails standards uh etc. So, so all
189:29 guardrails standards uh etc. So, so all of those issues when they are introduced
189:31 of those issues when they are introduced to thousands of developer that we asked
189:34 to thousands of developer that we asked them do you think like actually AI
189:36 them do you think like actually AI helped to reduce with those problems or
189:39 helped to reduce with those problems or or actually made more like more
189:42 or actually made more like more challenging 42 people reported that they
189:45 challenging 42 people reported that they spend 42 more of the development time on
189:48 spend 42 more of the development time on solving issues on fixing bugs etc and
189:51 solving issues on fixing bugs etc and and they saw 35 uh% project delays
189:57 and they saw 35 uh% project delays we're talking But we're talking about
189:58 we're talking But we're talking about maybe games they're talking about like
190:00 maybe games they're talking about like delays. Okay, there's some bias. We told
190:02 delays. Okay, there's some bias. We told them we talked about problem with
190:04 them we talked about problem with quality and what's the impact etc. Um
190:07 quality and what's the impact etc. Um but that's what they they they present
190:10 but that's what they they they present uh to when they they answer uh when when
190:12 uh to when they they answer uh when when you're talking about like when you're
190:14 you're talking about like when you're mass using AI code AI generated code and
190:17 mass using AI code AI generated code and we see reports uh some of the reports
190:19 we see reports uh some of the reports talking about 3x more security inc
190:22 talking about 3x more security inc incidents. By the way, it makes sense.
190:23 incidents. By the way, it makes sense. You remember we had a slide saying 3x
190:25 You remember we had a slide saying 3x more writing code. So 3x more security
190:28 more writing code. So 3x more security incidents like the same amount of line
190:29 incidents like the same amount of line of code the same amount of uh uh
190:31 of code the same amount of uh uh problems correlation. So what to do with
190:33 problems correlation. So what to do with that? Like I talked about problems and
190:35 that? Like I talked about problems and problems and problems. Okay, help help
190:37 problems and problems. Okay, help help me deal with it. Like let's let's spend
190:39 me deal with it. Like let's let's spend a few minutes on on that. So one one
190:42 a few minutes on on that. So one one suspect of course is testing and
190:45 suspect of course is testing and actually really interesting. We asked a
190:47 actually really interesting. We asked a couple of question about testing and one
190:49 couple of question about testing and one really relevant saying that people said
190:51 really relevant saying that people said that when they heavily use AI to on
190:54 that when they heavily use AI to on testing use AI to do testing they
190:58 testing use AI to do testing they actually double their trust in the AI
191:01 actually double their trust in the AI generated code. Okay, that's one thing.
191:04 generated code. Okay, that's one thing. The ne next suspect to help us with the
191:06 The ne next suspect to help us with the quality is code review. What really
191:09 quality is code review. What really interesting about code review that it's
191:10 interesting about code review that it's a process that helps almost with all the
191:13 a process that helps almost with all the process level and the code level like
191:17 process level and the code level like issues. For example, you can set your AI
191:19 issues. For example, you can set your AI code review tool to tell you block this
191:22 code review tool to tell you block this PR if it doesn't cover certain level of
191:25 PR if it doesn't cover certain level of test coverage. So through the PR, you
191:28 test coverage. So through the PR, you take care of the testing process
191:30 take care of the testing process problem. Okay. So code like code review
191:33 problem. Okay. So code like code review with AI is actually one of one of the
191:36 with AI is actually one of one of the major things you you you can do and
191:38 major things you you you can do and people that are developers that are
191:40 people that are developers that are using AI code review tool they're saying
191:42 using AI code review tool they're saying that they're saying they're seeing
191:44 that they're saying they're seeing double the quality gain and they're
191:46 double the quality gain and they're saying that actually it's it helps them
191:49 saying that actually it's it helps them to uh uh improve improve 47% in
191:53 to uh uh improve improve 47% in productivity of writing code. Okay. Now
191:56 productivity of writing code. Okay. Now a bit statistics from our own uh AI code
191:59 a bit statistics from our own uh AI code review tool. We scan a million of PRs a
192:02 review tool. We scan a million of PRs a month and we took one mill million of
192:04 month and we took one mill million of those PRs and we noticed that 17%
192:07 those PRs and we noticed that 17% include like high severity issues. By
192:09 include like high severity issues. By the way, we're now analyzing uh before
192:11 the way, we're now analyzing uh before and after using AI. I don't have that
192:13 and after using AI. I don't have that statistics yet, but we are noticing
192:15 statistics yet, but we are noticing since we're starting uh most of the
192:17 since we're starting uh most of the companies we serve, they use AI
192:19 companies we serve, they use AI generated code. So that's why uh I don't
192:21 generated code. So that's why uh I don't have before. We need to go scan
192:23 have before. We need to go scan backwards. Uh and that's like a really
192:26 backwards. Uh and that's like a really big a big number. Another thing I want
192:28 big a big number. Another thing I want to talk to you like about uh when you're
192:30 to talk to you like about uh when you're trying to improve on quality is is the
192:33 trying to improve on quality is is the foundation of having the right context
192:35 foundation of having the right context that is brought to the uh code
192:38 that is brought to the uh code generation tool that is brought to the
192:40 generation tool that is brought to the AI code review tool. Better context
192:43 AI code review tool. Better context better quality across the board wherever
192:45 better quality across the board wherever you're using AI. Uh so when we asked
192:47 you're using AI. Uh so when we asked developers when when you h when you
192:49 developers when when you h when you don't trust AI generated code like you
192:52 don't trust AI generated code like you remember like 67% sa like are really
192:54 remember like 67% sa like are really worried about that they said 80 80% of
192:58 worried about that they said 80 80% of the time they don't trust the context
193:00 the time they don't trust the context that the LLM have okay and and and uh
193:04 that the LLM have okay and and and uh when we asked developers what would you
193:06 when we asked developers what would you like to be improved in your AI generated
193:09 like to be improved in your AI generated code in your AI code review tool they
193:11 code in your AI code review tool they said the number one was context it was
193:13 said the number one was context it was number one was 33% they could choose
193:16 number one was 33% they could choose among many things to to improve. So
193:18 among many things to to improve. So context is extremely important. I can
193:20 context is extremely important. I can tell you that as codto one of our
193:22 tell you that as codto one of our technology moes uh is is around context
193:25 technology moes uh is is around context and when you connect our context engine
193:27 and when you connect our context engine we're seeing it as the number one tool
193:29 we're seeing it as the number one tool that is being used like 60% of code
193:32 that is being used like 60% of code generator or code review tools 60% of
193:35 generator or code review tools 60% of their calls to an MCP would be to a
193:37 their calls to an MCP would be to a context MCP. Okay. And just to tell you
193:41 context MCP. Okay. And just to tell you the context doesn't necessarily need to
193:43 the context doesn't necessarily need to include only your code. It could also
193:45 include only your code. It could also include context to your standards, your
193:47 include context to your standards, your best practices. We're seeing in our AI
193:49 best practices. We're seeing in our AI code review that 8% of the context usage
193:52 code review that 8% of the context usage is actually from files that are related
193:54 is actually from files that are related to standards and and best practices etc.
193:57 to standards and and best practices etc. Okay, I have to CEO of Kodo like
194:00 Okay, I have to CEO of Kodo like marketing will be mad on me if I don't
194:01 marketing will be mad on me if I don't brag a little bit. Right? So this is uh
194:04 brag a little bit. Right? So this is uh kind of like our architecture of our
194:06 kind of like our architecture of our context engine being presented by Jensen
194:08 context engine being presented by Jensen on GTC keynote. And he notice he didn't
194:11 on GTC keynote. And he notice he didn't talk about our qu co code review
194:13 talk about our qu co code review capabilities about our testing
194:14 capabilities about our testing capabilities. He talked about our
194:16 capabilities. He talked about our context engine that Nvidia checked
194:18 context engine that Nvidia checked because there's a realization that AI
194:21 because there's a realization that AI quality AI generated whatever review
194:23 quality AI generated whatever review testing will come from bringing the
194:25 testing will come from bringing the right context. So invest in that you
194:28 right context. So invest in that you need to build your context. Buy a
194:30 need to build your context. Buy a solution and invest in it. Build your
194:32 solution and invest in it. Build your solution uh etc. And the context needs
194:35 solution uh etc. And the context needs to include code uh uh versioning PR
194:39 to include code uh uh versioning PR history uh organization logs etc. That's
194:42 history uh organization logs etc. That's where all the context sits. It's not
194:44 where all the context sits. It's not just in the last branch of your
194:46 just in the last branch of your codebase. Okay. So I'm I'm zooming out
194:49 codebase. Okay. So I'm I'm zooming out starting to talk about like
194:50 starting to talk about like recommendations and uh and like uh
194:54 recommendations and uh and like uh takeaways. So what what what's next? So
194:57 takeaways. So what what what's next? So automated uh quality gateways invest in
195:00 automated uh quality gateways invest in that. People talked throughout the
195:02 that. People talked throughout the morning about parallel agents. You know
195:04 morning about parallel agents. You know what I'm talking about? Like background
195:05 what I'm talking about? Like background agents. You can use a lot of those like
195:09 agents. You can use a lot of those like tools and capabilities to build build
195:10 tools and capabilities to build build your quality gates. Uh use intelligent
195:13 your quality gates. Uh use intelligent code review testing and you need a li
195:17 code review testing and you need a li living and breathing like documentation
195:20 living and breathing like documentation and and what documentation means is is a
195:22 and and what documentation means is is a story by itself. Uh I'm not going to
195:24 story by itself. Uh I'm not going to double click on it. And and this is how
195:27 double click on it. And and this is how I present for three years now and I
195:30 I present for three years now and I think I'm going to go all the way until
195:33 think I'm going to go all the way until age of 60 with this slide of how I think
195:36 age of 60 with this slide of how I think the future of software development looks
195:38 the future of software development looks like. Okay. So basically you have your
195:41 like. Okay. So basically you have your specification and you have your code
195:44 specification and you have your code right and you have multiple agents
195:46 right and you have multiple agents parallel agents that are helping you to
195:49 parallel agents that are helping you to improve your spec write your spec
195:50 improve your spec write your spec improve your code transfer transfer from
195:53 improve your code transfer transfer from your spec to your to your code uh make
195:57 your spec to your to your code uh make tests which are executable specs right
196:00 tests which are executable specs right uh and and then you're going to have
196:01 uh and and then you're going to have your context engine the software
196:02 your context engine the software development database and you will build
196:05 development database and you will build your tools especially MCPs around
196:07 your tools especially MCPs around quality and verification and you will
196:10 quality and verification and you will Make sure you have environments, stable,
196:13 Make sure you have environments, stable, secured sandboxes where those agents can
196:16 secured sandboxes where those agents can run and and run validation and quality
196:18 run and and run validation and quality uh workflows. So don't don't forget like
196:21 uh workflows. So don't don't forget like the path forward is quality is your
196:24 the path forward is quality is your competitive edge over your uh
196:27 competitive edge over your uh competition. AI is a tool. It's not it's
196:30 competition. AI is a tool. It's not it's not a solution. Okay? And don't like
196:33 not a solution. Okay? And don't like only think about code generation as the
196:35 only think about code generation as the only thing. Look on the entire SDLC or
196:38 only thing. Look on the entire SDLC or product development life cycle. I saw
196:40 product development life cycle. I saw one of the uh people talked um speakers
196:44 one of the uh people talked um speakers and it iterate with everything we talked
196:46 and it iterate with everything we talked about today. I have uh I want to tell
196:48 about today. I have uh I want to tell you that you will gain value from it.
196:51 you that you will gain value from it. We're seeing in the reports people
196:53 We're seeing in the reports people seeing like security availability being
196:55 seeing like security availability being reduced faster code review you we just
196:58 reduced faster code review you we just got a hit on that because of AI
197:00 got a hit on that because of AI generated code and test coverage in a
197:02 generated code and test coverage in a month can can triple depends on on the
197:05 month can can triple depends on on the project etc. with with the last minute I
197:08 project etc. with with the last minute I want to show you like a really small
197:09 want to show you like a really small piece of what you can do with codo. uh
197:12 piece of what you can do with codo. uh you can go into codto and define your
197:14 you can go into codto and define your own rule for example almost the same
197:17 own rule for example almost the same rule you'll put on cursor of I don't
197:19 rule you'll put on cursor of I don't like nested ifs if this is a problem
197:21 like nested ifs if this is a problem that you have but then codto will look
197:24 that you have but then codto will look on your context build the good example
197:26 on your context build the good example the bad example and then start giving
197:29 the bad example and then start giving like building a workflow that is
197:31 like building a workflow that is specifically to catch that issue and
197:35 specifically to catch that issue and give you statistics over time when it's
197:38 give you statistics over time when it's being accepted and when not so you can
197:40 being accepted and when not so you can adjust that rule and really know and
197:42 adjust that rule and really know and have visibility to to your standards.
197:45 have visibility to to your standards. Okay. So when a PR is written with a few
197:48 Okay. So when a PR is written with a few ifs and else although it was written
197:50 ifs and else although it was written with cursor copilot that had a rule do
197:53 with cursor copilot that had a rule do not do nested ifs etc. then eventually
197:56 not do nested ifs etc. then eventually when you open a PR you will get uh codo
198:00 when you open a PR you will get uh codo uh uh catching that and giving a
198:02 uh uh catching that and giving a suggestion according to the good and the
198:04 suggestion according to the good and the bad example. COD will also make a graph,
198:06 bad example. COD will also make a graph, give you a CLI checks like check each
198:09 give you a CLI checks like check each one of the rules and eventually tell you
198:11 one of the rules and eventually tell you the nested if and then we'll record and
198:15 the nested if and then we'll record and learn what you did or did not do with
198:17 learn what you did or did not do with that suggestion in order to adapt the
198:19 that suggestion in order to adapt the standard and of the of the quality. Um
198:22 standard and of the of the quality. Um there will also automated like
198:24 there will also automated like suggestion. You don't need to write your
198:25 suggestion. You don't need to write your own. It learns your your your standards
198:28 own. It learns your your your standards and quality and offer that to you. And
198:30 and quality and offer that to you. And that's it. I'm I'm really really excited
198:33 that's it. I'm I'm really really excited about like breaking the glass ceiling,
198:35 about like breaking the glass ceiling, okay, with what we did with code
198:37 okay, with what we did with code generation and then a jet to code
198:39 generation and then a jet to code generation. Now we're turning into the
198:42 generation. Now we're turning into the era of putting AI into work and through
198:45 era of putting AI into work and through the entire SDLC. The most important part
198:48 the entire SDLC. The most important part is related to quality. You would need to
198:50 is related to quality. You would need to invest in that. It's not out of the box.
198:52 invest in that. It's not out of the box. Okay. And then you would see eventually
198:55 Okay. And then you would see eventually the promised 2x that that that probably
198:59 the promised 2x that that that probably promised to the CEO or something like
199:01 promised to the CEO or something like that once they give you the budget for
199:02 that once they give you the budget for for the relevant tools. Thank you so
199:04 for the relevant tools. Thank you so much. [applause]
199:07 much. [applause] [music]
199:16 Our next speaker is introducing Miniaax's latest model and how it powers
199:19 Miniaax's latest model and how it powers nextG experiences for code generation.
199:22 nextG experiences for code generation. Please welcome to the stage senior
199:24 Please welcome to the stage senior researcher at Miniaax, Olive Song.
199:30 researcher at Miniaax, Olive Song. [music]
199:42 Hi. Hi everyone. Um, I'm Olive. It's my great honor here today to present on our
199:44 great honor here today to present on our new model Mini Max M2. Um, I actually
199:47 new model Mini Max M2. Um, I actually lived in New York City for six years, so
199:49 lived in New York City for six years, so it feels great to come back. Um, but
199:51 it feels great to come back. Um, but with a different role. Um, I currently
199:53 with a different role. Um, I currently study reinforcement learning and model
199:55 study reinforcement learning and model evaluation at Miniax. Um, let me just
199:59 evaluation at Miniax. Um, let me just get a quick sense of the room. Who here
200:01 get a quick sense of the room. Who here has heard or have tried of Miniax
200:03 has heard or have tried of Miniax before? Oh, a couple of there. Yeah, not
200:08 before? Oh, a couple of there. Yeah, not everybody, but I guess yeah, but here's
200:10 everybody, but I guess yeah, but here's the value, right, of me standing here
200:12 the value, right, of me standing here today. Um so we are a global company
200:17 today. Um so we are a global company that works on both foundation models and
200:19 that works on both foundation models and applications. We develop multi modality
200:22 applications. We develop multi modality models including text um vision language
200:26 models including text um vision language models our video generation model hyo
200:28 models our video generation model hyo and speech generation music generation
200:31 and speech generation music generation stuff and we also have um many
200:33 stuff and we also have um many applications including agents and stuff
200:36 applications including agents and stuff um inhouse. So that that's the specific
200:40 um inhouse. So that that's the specific thing that's different from the other
200:41 thing that's different from the other labs for other companies. So we both
200:44 labs for other companies. So we both develop foundation models um and
200:47 develop foundation models um and applications. So we have research and
200:49 applications. So we have research and developers sitting uh sitting side by
200:52 developers sitting uh sitting side by side working on things. Um so our
200:55 side working on things. Um so our difference would be that we have
200:57 difference would be that we have firsthand experience from our um
201:01 firsthand experience from our um in-house developers into developing
201:03 in-house developers into developing models that developers would really need
201:07 models that developers would really need in the community. And here I want to
201:09 in the community. And here I want to introduce our Miniax M2 um which is an
201:13 introduce our Miniax M2 um which is an openweight model very small with only 10
201:16 openweight model very small with only 10 billion active parameters um that was
201:19 billion active parameters um that was designed specifically for coding
201:21 designed specifically for coding workplace agentic tasks. It's very
201:24 workplace agentic tasks. It's very costefficient.
201:27 costefficient. Um let me just go over the benchmark
201:30 Um let me just go over the benchmark performance because people care about
201:31 performance because people care about it. So uh we rank very top in both um
201:36 it. So uh we rank very top in both um intelligence benchmarks and also agent
201:39 intelligence benchmarks and also agent benchmarks. Uh we I think we're on the
201:42 benchmarks. Uh we I think we're on the top of the open source models. But then
201:45 top of the open source models. But then numbers don't tell everything because
201:47 numbers don't tell everything because sometimes you get those super high
201:50 sometimes you get those super high number models you plug into them um into
201:53 number models you plug into them um into your environment and they suck, right?
201:55 your environment and they suck, right? So we really care about the dynamics in
201:59 So we really care about the dynamics in the community and in our first week we
202:01 the community and in our first week we had the most downloads
202:04 had the most downloads and also we climbed up to top three
202:06 and also we climbed up to top three token usage on open router. So we're
202:09 token usage on open router. So we're very glad that people in the community
202:11 very glad that people in the community are really loving our model um into
202:13 are really loving our model um into their development cycle.
202:16 their development cycle. So today what I want to share is how we
202:19 So today what I want to share is how we actually shape these men model
202:22 actually shape these men model characteristics that made M2 so good in
202:25 characteristics that made M2 so good in your coding experience. And I'm going to
202:28 your coding experience. And I'm going to present to you um the training be behind
202:31 present to you um the training be behind it that supports each one of them from
202:34 it that supports each one of them from coding experience to long horizon state
202:37 coding experience to long horizon state tracking tasks um to robust
202:40 tracking tasks um to robust generalization to different scaffolds to
202:42 generalization to different scaffolds to multi- aent uh scalability.
202:46 multi- aent uh scalability. So first let's talk about code
202:48 So first let's talk about code experience which we sc uh which we
202:50 experience which we sc uh which we supported with um scaled environments
202:53 supported with um scaled environments and scaled experts.
202:56 and scaled experts. So um developers need a model that can
203:00 So um developers need a model that can actually work in the language they use
203:02 actually work in the language they use and across the workflow that they deal
203:04 and across the workflow that they deal with every day. So which means that we
203:07 with every day. So which means that we need to utilize the real data from from
203:09 need to utilize the real data from from the internet and then um scale the
203:12 the internet and then um scale the number of environments so that the model
203:15 number of environments so that the model when during training for example during
203:16 when during training for example during reinforcement learning it can actually
203:19 reinforcement learning it can actually um reacts to the uh environment. it can
203:22 um reacts to the uh environment. it can actually target verifiable coding goals
203:25 actually target verifiable coding goals and to learn from it. So that's why we
203:27 and to learn from it. So that's why we scaled both the number uh of
203:30 scaled both the number uh of environments and also our um
203:33 environments and also our um infrastructure so that we can perform
203:35 infrastructure so that we can perform those training very efficiently.
203:38 those training very efficiently. So um with data construction and
203:41 So um with data construction and reinforcement learning we were able to
203:43 reinforcement learning we were able to train the model so that it's very strong
203:46 train the model so that it's very strong um it's full stack multilingual
203:50 um it's full stack multilingual and what I want to mention here is that
203:52 and what I want to mention here is that besides scaling environment that
203:54 besides scaling environment that everybody talks about we actually scale
203:56 everybody talks about we actually scale something called expert developers um as
203:59 something called expert developers um as reward models. So as I mentioned before
204:02 reward models. So as I mentioned before uh we have a ton of um super expert
204:06 uh we have a ton of um super expert developers in house that could give us
204:08 developers in house that could give us feedback to our model's performance. So
204:11 feedback to our model's performance. So they participated closely into the model
204:14 they participated closely into the model development and training cycle including
204:16 development and training cycle including problem definition for example um bugs
204:20 problem definition for example um bugs bug fixing for example um repo
204:23 bug fixing for example um repo refactoring and stuff like that. And
204:25 refactoring and stuff like that. And also they identify the model behaviors
204:27 also they identify the model behaviors that developers enjoy and they identify
204:31 that developers enjoy and they identify what's reliable and uh what developers
204:33 what's reliable and uh what developers would trust
204:35 would trust and they give precise reward and
204:37 and they give precise reward and evaluation to the model's behaviors to
204:40 evaluation to the model's behaviors to the final um deliverables so that um it
204:43 the final um deliverables so that um it is a model that developers really want
204:45 is a model that developers really want to work with and that can adds
204:47 to work with and that can adds efficiency to the developers.
204:54 So with that we were able to lead in many um languages in real use.
204:58 many um languages in real use. And the second characteristic that
204:59 And the second characteristic that Miniax M2 has is it it performs good in
205:04 Miniax M2 has is it it performs good in those long horizon tasks. Uh those long
205:07 those long horizon tasks. Uh those long tasks that require interacting with
205:10 tasks that require interacting with complex environments that requiring um
205:12 complex environments that requiring um using multiple tools with reasoning.
205:16 using multiple tools with reasoning. And we supported that with the interled
205:19 And we supported that with the interled thinking pattern um and reinforcement
205:21 thinking pattern um and reinforcement learning.
205:24 learning. So what is interled thinking? Um so with
205:28 So what is interled thinking? Um so with a normal reasoning model that can use
205:31 a normal reasoning model that can use tools, it it normally works like this.
205:33 tools, it it normally works like this. You have the tools information given to
205:35 You have the tools information given to it. You have the system prompts. Um you
205:38 it. You have the system prompts. Um you have user prompts and then the model
205:40 have user prompts and then the model would think and then it calls tools. It
205:43 would think and then it calls tools. It can be a couple of tools at the same
205:45 can be a couple of tools at the same time. And then they get the tool
205:48 time. And then they get the tool response from the environment and then
205:50 response from the environment and then it performs a final thinking and deliver
205:52 it performs a final thinking and deliver a final content. But but here's the
205:55 a final content. But but here's the truth, right? In real world, the
205:58 truth, right? In real world, the environments are often noisy and
206:00 environments are often noisy and dynamic. You can't really perform this
206:03 dynamic. You can't really perform this one test just by once. You can get um
206:06 one test just by once. You can get um tool errors for example. You can get um
206:09 tool errors for example. You can get um unexpected results from the environment
206:12 unexpected results from the environment and stuff like that. So um what we did
206:15 and stuff like that. So um what we did is that we imagine how humans interact
206:17 is that we imagine how humans interact with the world. We we we look at
206:20 with the world. We we we look at something we get feedbacks and then we
206:22 something we get feedbacks and then we think about it. We think if the feedback
206:24 think about it. We think if the feedback is good or not and then we make other
206:26 is good or not and then we make other actions, make other decisions. And
206:29 actions, make other decisions. And that's why we did the same thing with
206:31 that's why we did the same thing with our M2 model. So if we look at this um
206:34 our M2 model. So if we look at this um chart over a diagram on the right. So
206:38 chart over a diagram on the right. So instead of just stopping um after one
206:42 instead of just stopping um after one round of tool calling, it actually
206:44 round of tool calling, it actually thinks again and reacts to the uh reacts
206:48 thinks again and reacts to the uh reacts to the environments to see if the
206:51 to the environments to see if the information is enough for it to uh get
206:54 information is enough for it to uh get what it wants. So basically we call the
206:57 what it wants. So basically we call the interle thinking or people call it
206:59 interle thinking or people call it interle thinking because it interle
207:01 interle thinking because it interle thinking with tool calling. um a couple
207:04 thinking with tool calling. um a couple of time it can be you know uh tens to a
207:07 of time it can be you know uh tens to a hundred um turns of tool calling within
207:10 hundred um turns of tool calling within just one user interaction term
207:15 just one user interaction term so it helps um adaptation to environment
207:18 so it helps um adaptation to environment noise for example uh just like what I
207:21 noise for example uh just like what I mentioned the environment is it's it's
207:24 mentioned the environment is it's it's not stable all the time and then
207:26 not stable all the time and then something is suboptimal and then it can
207:28 something is suboptimal and then it can choose to use other tools or do other
207:30 choose to use other tools or do other decisions it can focus on long horizon
207:33 decisions it can focus on long horizon has um can automate your workflow um
207:37 has um can automate your workflow um using for example Gmails, notions um
207:40 using for example Gmails, notions um terminal all at the same time. You just
207:42 terminal all at the same time. You just need to maybe make one model call
207:45 need to maybe make one model call without minim with minimal um human
207:48 without minim with minimal um human intervention. It can do it all by
207:50 intervention. It can do it all by itself. And here's a cool illustration
207:53 itself. And here's a cool illustration on the right because it's New York City.
207:55 on the right because it's New York City. I feel the vibe of you know trading and
207:57 I feel the vibe of you know trading and marketing. Um so you can see that there
208:01 marketing. Um so you can see that there was some um there was some pertabbations
208:04 was some um there was some pertabbations in the stock market uh I think last week
208:07 in the stock market uh I think last week and then our model was able to keep it
208:10 and then our model was able to keep it stable. So just like I said there's like
208:12 stable. So just like I said there's like environment noise there's no new
208:14 environment noise there's no new information there's like yeah news it
208:18 information there's like yeah news it looks like there there's like other
208:19 looks like there there's like other trading policies and stuff like that but
208:21 trading policies and stuff like that but our model was able to uh to perform
208:25 our model was able to uh to perform pretty stably in these kind of
208:27 pretty stably in these kind of environments.
208:29 environments. And the third characteristic is our
208:32 And the third characteristic is our robust um generalization to many agent
208:35 robust um generalization to many agent scaffolds which was supported by our
208:38 scaffolds which was supported by our perturbations in the data pipeline.
208:42 perturbations in the data pipeline. So we want our agent to generalize. But
208:45 So we want our agent to generalize. But what is agent generalization?
208:47 what is agent generalization? At first we thought it was just tool
208:49 At first we thought it was just tool scaling. We train the model with enough
208:52 scaling. We train the model with enough tools, various tools kind of new tools.
208:54 tools, various tools kind of new tools. we invent tools um and then it will just
208:57 we invent tools um and then it will just perform good on unseen tools. Well, that
209:00 perform good on unseen tools. Well, that was kind of the truth. It worked at
209:02 was kind of the truth. It worked at first. Uh but then we soon realized that
209:05 first. Uh but then we soon realized that if we perturb the environment a little
209:08 if we perturb the environment a little bit, for example, we change another
209:09 bit, for example, we change another agent scaffold, then it doesn't
209:11 agent scaffold, then it doesn't generalize. So what is agent
209:14 generalize. So what is agent generalization?
209:16 generalization? Well, we conclude that um it's
209:18 Well, we conclude that um it's adaptation to perturbations across the
209:20 adaptation to perturbations across the model's entire uh operational space.
209:24 model's entire uh operational space. If we uh think back what's the model's
209:27 If we uh think back what's the model's um operational space that we talked
209:30 um operational space that we talked about it can be tool information it can
209:34 about it can be tool information it can be system prompts it can be user prompts
209:36 be system prompts it can be user prompts they can all all be different they can
209:39 they can all all be different they can be the chat template they can be the
209:41 be the chat template they can be the environment they can be the tool
209:43 environment they can be the tool response. So what we did is that we
209:45 response. So what we did is that we designed and maintain perturbation
209:47 designed and maintain perturbation pipelines of our data so that um our
209:50 pipelines of our data so that um our model can actually gen generalized to a
209:53 model can actually gen generalized to a lot of agent scaffolds.
209:57 lot of agent scaffolds. And the fourth characteristic that I
209:59 And the fourth characteristic that I want to mention is the multi- aent
210:01 want to mention is the multi- aent scalability
210:03 scalability um which is very possible with M2
210:06 um which is very possible with M2 because it's very small and cost
210:08 because it's very small and cost effective.
210:14 I have a couple of videos here. Um, this is M2 powered by our own Miniax agent uh
210:19 is M2 powered by our own Miniax agent uh app. We actually have the QR code
210:21 app. We actually have the QR code downside. So, if you want it, you can
210:23 downside. So, if you want it, you can just scan and try it. So, it's like an
210:25 just scan and try it. So, it's like an agent app we we developed. And here we
210:29 agent app we we developed. And here we can see different copies of M2, right?
210:31 can see different copies of M2, right? It can do research. um it can write the
210:36 It can do research. um it can write the write the research results and analyze
210:37 write the research results and analyze it and put it in a re report. It can put
210:40 it and put it in a re report. It can put it in some kind of front-end
210:43 it in some kind of front-end illustration and they can work in
210:45 illustration and they can work in parallel. So because it is so small um
210:48 parallel. So because it is so small um and so cost effective, it can really um
210:51 and so cost effective, it can really um support those long run agentic tasks and
210:54 support those long run agentic tasks and tasks that maybe um require some kind of
210:57 tasks that maybe um require some kind of parallelism.
211:04 So what's next right for Miniax M2 from what I've introduced we gathered
211:06 what I've introduced we gathered environments um algorithms data expert
211:10 environments um algorithms data expert values model architecture inference
211:12 values model architecture inference evaluation all these stuff to build a
211:15 evaluation all these stuff to build a model um that was you know fast that was
211:19 model um that was you know fast that was uh intelligent that could use tools that
211:22 uh intelligent that could use tools that generalizes
211:23 generalizes what's next
211:26 what's next for um M2.1 1 and M3 were in the future.
211:31 for um M2.1 1 and M3 were in the future. We thinks of better coding, maybe memory
211:33 We thinks of better coding, maybe memory work, context management, proactive AI
211:37 work, context management, proactive AI for workplace, vertical experts, and
211:40 for workplace, vertical experts, and because we have those great audio
211:43 because we have those great audio generation, video generation models,
211:46 generation, video generation models, maybe we can integrate them. But all our
211:48 maybe we can integrate them. But all our mission is that we're committed to bring
211:51 mission is that we're committed to bring all these resources, whatever is on the
211:53 all these resources, whatever is on the screen and maybe more. Yeah. and values
211:56 screen and maybe more. Yeah. and values and put them all together to develop
211:59 and put them all together to develop models for uh the community to use. So
212:04 models for uh the community to use. So um we really need feedback from the
212:06 um we really need feedback from the community if possible because we want to
212:08 community if possible because we want to build this together and you know this is
212:10 build this together and you know this is kind of a race that everyone needs to
212:13 kind of a race that everyone needs to participate and then um we com we are
212:18 participate and then um we com we are committed to share it with the
212:19 committed to share it with the community. Yeah.
212:22 community. Yeah. And that's all the insights for today.
212:25 And that's all the insights for today. Um, we really hope again we really hope
212:28 Um, we really hope again we really hope you to try the model because it's pretty
212:31 you to try the model because it's pretty good. And then we can contact contact us
212:33 good. And then we can contact contact us up there. You can try the models by
212:36 up there. You can try the models by scanning the QR code. Yeah, basically
212:38 scanning the QR code. Yeah, basically that's it. Thank you all for listening.
212:40 that's it. Thank you all for listening. [applause]
212:52 Ladies [music] and gentlemen, please welcome back to the stage, Alex
212:54 welcome back to the stage, Alex Lieberman.
212:56 Lieberman. Let's give it up again for Olive [music]
212:58 Let's give it up again for Olive [music] and all the other speakers from the
213:00 and all the other speakers from the morning. [applause]
213:02 morning. [applause] It is time for lunch. Very exciting. Uh,
213:06 It is time for lunch. Very exciting. Uh, one thing I want to say before we head
213:07 one thing I want to say before we head out for lunch and we're it's going to be
213:09 out for lunch and we're it's going to be downstairs in the expo. check out all
213:12 downstairs in the expo. check out all the boos, talk to people, have food is,
213:16 the boos, talk to people, have food is, you know, my own experience with going
213:17 you know, my own experience with going to conferences is even though I come up
213:21 to conferences is even though I come up talk on stage a lot, I find it very
213:23 talk on stage a lot, I find it very difficult to engage in conversation with
213:25 difficult to engage in conversation with people when there's like these little
213:26 people when there's like these little small group settings. I don't know like
213:28 small group settings. I don't know like can I go and chat with people? Can I
213:29 can I go and chat with people? Can I not? This is a kind of awkward. I give
213:32 not? This is a kind of awkward. I give you all permission to butt into
213:33 you all permission to butt into conversations, introduce yourself. Ben
213:36 conversations, introduce yourself. Ben and Swix have done an incredible job of
213:38 and Swix have done an incredible job of cultivating such a high quality
213:40 cultivating such a high quality community here. And the most value you
213:43 community here. And the most value you will get is not just from these
213:45 will get is not just from these incredible presentations. It's from
213:47 incredible presentations. It's from meeting other folks in the crowd. So
213:49 meeting other folks in the crowd. So please, you have my permission butt into
213:51 please, you have my permission butt into conversations. Introduce yourself. Share
213:53 conversations. Introduce yourself. Share what you've learned with folks. And if
213:55 what you've learned with folks. And if you need any sort of uh ice breakers to
213:57 you need any sort of uh ice breakers to get the conversation going, I have two
213:59 get the conversation going, I have two for you. One is just go into a group and
214:02 for you. One is just go into a group and share your hottest take on uh the state
214:04 share your hottest take on uh the state of AI today. It's a great way to get off
214:06 of AI today. It's a great way to get off to a good start with someone. The
214:08 to a good start with someone. The second, a little less intense, is is a
214:10 second, a little less intense, is is a hot dog a sandwich? Is cereal uh in milk
214:14 hot dog a sandwich? Is cereal uh in milk a soup? That is how you're going to
214:16 a soup? That is how you're going to start the conversations with folks.
214:18 start the conversations with folks. Everyone enjoy lunch. We'll see you back
214:19 Everyone enjoy lunch. We'll see you back in an hour and uh thanks so much for
214:21 in an hour and uh thanks so much for your time.
215:16 Heat. Heat. [music]
216:02 Heat. [music]
216:34 [music] >> Heat up here.
217:24 Heat up
217:30 here. [music]
218:16 >> Heat up here.
218:19 here. [music]
219:07 Heat. [music] Heat.
219:35 [music] Heat.
220:21 [music] here.
220:46 >> Heat [music]
220:47 [music] up here.
221:39 >> Heat up here.
221:42 up here. [music]
222:10 >> Heat. Heat. [music]
222:15 Heat. [music]
223:09 [music] >> Heat.
223:16 [music] Heat.
224:42 Heat. [music]
225:35 [music] Heat. Heat.
226:05 Heat. Heat. [music]
226:24 >> Heat. [music] Heat.
230:20 [music] Heat.
230:22 Heat. Heat [music]
233:09 Heat up
234:21 Heat up here. [music]
234:50 Heat [music] up here.
235:06 Heat. [music]
235:23 Heat. [music]
235:44 Heat. [music]
237:02 >> [music] >> Heat. Heat.
239:17 here. [music]
240:34 [music] Heat
241:04 [music] Heat
241:49 [music] Heat.
242:15 Heat [music]
242:53 >> Heat. Heat. [music]
243:01 [music] Heat. Heat.
244:48 here. [music]
244:49 [music] Heat.
245:31 Heat up here. [music]
245:55 >> Heat up here.
246:26 up here. [music]
246:52 >> [music] >> Heat. Heat.
247:27 here. Heat. Heat.
250:23 [music] >> Heat. Heat.
252:02 [music] Heat.
252:46 >> Heat up here.
253:09 >> Heat. [music]
253:11 [music] Heat.
253:33 >> up [music] here.
254:03 [music] Heat.
254:22 [music] Heat.
255:30 >> [music] >> Heat up
255:34 >> Heat up here.
256:22 here. [music]
257:06 >> [music] >> up
257:20 here. Heat. Heat.
257:57 >> Heat up
258:51 Heat. Heat. [music]
259:42 Heat up here.
260:52 [music] Heat.
261:57 [music] >> Heat up here.
262:29 >> [music] >> Heat up here.
266:26 Heat. [music]
267:04 Heat. Heat.
267:58 >> Heat. Heat.
268:04 [music] Heat. Heat.
268:51 [music] Hey. Hey.
269:08 >> Heat. Heat.
269:47 Heat. Heat. Heat. [music]
270:08 Heat. Heat. [music]
271:13 >> Heat up here.
271:15 here. [music]
271:48 [music] Heat. Heat.
272:29 [music] up
272:31 up here.
272:38 Heat. [music]
272:55 Heat. [music]
273:14 >> Heat up here. [music]
273:44 >> Heat [music]
273:52 up here. Heat. Heat.
274:18 [music] Heat. Heat.
276:05 >> Heat up [music]
276:32 [music] Heat.
276:35 Heat. [music]
277:05 >> Heat. Heat. [music]
277:32 Heat. Heat. [music]
279:27 >> up [music]
280:39 [music] Heat.
280:46 [music] Heat. Heat.
281:39 >> Heat >> [music]
281:48 [music] >> Heat.
281:54 Heat. [music]
282:16 >> [music] >> Heat up here.
282:53 Heat. [music]
283:25 >> [music] >> Heat up here.
283:41 Heat >> [music]
284:07 Heat [music]
284:18 [music] up here.
285:05 >> How's everyone How you doing? Good lunch. [music]
285:06 lunch. [music] Excited for the afternoon sessions. Out
285:08 Excited for the afternoon sessions. Out of curiosity, did anyone have the hot
285:10 of curiosity, did anyone have the hot dog conversation? Does anyone think Who
285:12 dog conversation? Does anyone think Who thinks that a hot dog's a sandwich?
285:15 thinks that a hot dog's a sandwich? We got one. We got two.
285:18 We got one. We got two. Uh, anyone think a hot dog isn't a
285:20 Uh, anyone think a hot dog isn't a sandwich? Most of the crowd. That is
285:22 sandwich? Most of the crowd. That is that is usually the consensus. Uh, one
285:24 that is usually the consensus. Uh, one other question. Who thinks that they
285:26 other question. Who thinks that they have the hottest take on the state of AI
285:29 have the hottest take on the state of AI or AI engineering right now in the room?
285:32 or AI engineering right now in the room? Anyone think they have the hottest take?
285:34 Anyone think they have the hottest take? Well, I I'll I'll give you uh a tea up
285:36 Well, I I'll I'll give you uh a tea up for later. My co-founder Arman is
285:39 for later. My co-founder Arman is speaking around four and I would say he
285:40 speaking around four and I would say he has one of the hotter takes I've seen,
285:42 has one of the hotter takes I've seen, which is he thinks all engineers should
285:44 which is he thinks all engineers should be paid like salespeople based on
285:46 be paid like salespeople based on output. That is going to attract a lot
285:48 output. That is going to attract a lot of debate and I give you full permission
285:50 of debate and I give you full permission to debate him after his talk. Well, are
285:53 to debate him after his talk. Well, are you guys ready to jump into the next
285:55 you guys ready to jump into the next group of sessions?
285:56 group of sessions? >> Let's do it. We will be diving into
285:59 >> Let's do it. We will be diving into proactive agents from Google labs,
286:01 proactive agents from Google labs, building Gen BI at a Fortune 100
286:03 building Gen BI at a Fortune 100 business, deploying AI within
286:06 business, deploying AI within Bloomberg's engineering org, lessons
286:08 Bloomberg's engineering org, lessons learned building an AI browser, and
286:11 learned building an AI browser, and developer experience in the age of AI
286:13 developer experience in the age of AI coding agents. With that, please join me
286:16 coding agents. With that, please join me in welcoming our next speaker, Kath
286:18 in welcoming our next speaker, Kath Corvec, director of product at Google
286:20 Corvec, director of product at Google Labs. Let's give it to her.
286:37 >> Hi everybody. I'm so excited to be here. I love New
286:39 I'm so excited to be here. I love New York and I love meeting everybody here.
286:42 York and I love meeting everybody here. And I am Kath Corbec. I'm from Google
286:45 And I am Kath Corbec. I'm from Google Labs and I work on this little team
286:47 Labs and I work on this little team called ADA. And I'm going to be talking
286:48 called ADA. And I'm going to be talking about some of the stuff that we've been
286:50 about some of the stuff that we've been doing on this project called Jewels. So,
286:53 doing on this project called Jewels. So, a few months ago in my household, our
286:56 a few months ago in my household, our dishwasher broke. And while it was being
286:58 dishwasher broke. And while it was being repaired, my husband decided that he was
287:00 repaired, my husband decided that he was going to do all the dishes. And so, he
287:02 going to do all the dishes. And so, he told me he was going to do this. But
287:03 told me he was going to do this. But every single night, I found myself
287:05 every single night, I found myself reminding him to do the dishes. And you
287:07 reminding him to do the dishes. And you can imagine that got old pretty fast.
287:10 can imagine that got old pretty fast. And I realized that even though I wasn't
287:12 And I realized that even though I wasn't physically washing the dishes, I was
287:14 physically washing the dishes, I was still carrying this mental load. I know
287:16 still carrying this mental load. I know a lot of you can probably relate to
287:18 a lot of you can probably relate to this. I was keeping track of whether or
287:20 this. I was keeping track of whether or not that task was done, following up,
287:23 not that task was done, following up, making sure that things kept moving. And
287:25 making sure that things kept moving. And I realized in that moment that that's
287:27 I realized in that moment that that's exactly where we are with asynchronous
287:29 exactly where we are with asynchronous agents today. They can handle some of
287:31 agents today. They can handle some of the work, but we're still the ones as
287:33 the work, but we're still the ones as developers carrying that mental load and
287:35 developers carrying that mental load and monitoring them. So here's the truth.
287:39 monitoring them. So here's the truth. Humans, we are serial processors, not
287:42 Humans, we are serial processors, not parallel ones. We can juggle multiple
287:45 parallel ones. We can juggle multiple goals, but we execute them in sequence,
287:48 goals, but we execute them in sequence, not all at once. When you manually kick
287:50 not all at once. When you manually kick off a task in jewels, you're usually
287:52 off a task in jewels, you're usually waiting to be able to move on. And it's
287:56 waiting to be able to move on. And it's that pause, it's that gap in attention
287:58 that pause, it's that gap in attention where we really lose momentum. And this
288:00 where we really lose momentum. And this is actually backed up by science where
288:03 is actually backed up by science where uh humans actually think we think we're
288:05 uh humans actually think we think we're multitaskers, but we're actually
288:07 multitaskers, but we're actually executing many tasks very rapidly. But
288:11 executing many tasks very rapidly. But switching between these tasks comes with
288:13 switching between these tasks comes with a huge cost. It can cost up to 40% of
288:16 a huge cost. It can cost up to 40% of your productive time. So that's like
288:18 your productive time. So that's like half a day lost to switching contexts
288:22 half a day lost to switching contexts and reloading. So if humans are uniters,
288:27 and reloading. So if humans are uniters, what's the solution here with agents? So
288:30 what's the solution here with agents? So for async agents, in order in order for
288:32 for async agents, in order in order for them to succeed, developers can't be
288:34 them to succeed, developers can't be expected to babysit them.
288:37 expected to babysit them. We've all seen that post on Twitter of
288:40 We've all seen that post on Twitter of 16 different cla tasks running in
288:42 16 different cla tasks running in parallel on 16 different terminals on
288:45 parallel on 16 different terminals on three different huge browsers or huge
288:47 three different huge browsers or huge monitors. And when I first saw this, I
288:49 monitors. And when I first saw this, I thought, god forbid that is the DevX of
288:52 thought, god forbid that is the DevX of the future. I want to I don't want to
288:55 the future. I want to I don't want to manage work. I don't want to manage my
288:56 manage work. I don't want to manage my agents. I want to be a coder. I want to
288:58 agents. I want to be a coder. I want to build. And so we need to think we need
289:02 build. And so we need to think we need uh uh collaborators in our system that
289:04 uh uh collaborators in our system that we can trust. agents that really
289:06 we can trust. agents that really understand context, can anticipate our
289:09 understand context, can anticipate our needs, and they know really when to step
289:12 needs, and they know really when to step in. And then uh I think finally, we're
289:15 in. And then uh I think finally, we're reaching that point with models where
289:17 reaching that point with models where they're getting better and better at
289:18 they're getting better and better at executing end to end as long as they
289:22 executing end to end as long as they understand what our goals are clearly.
289:24 understand what our goals are clearly. And that's where trust really becomes
289:26 And that's where trust really becomes this unlock where you can trust the
289:28 this unlock where you can trust the system to know what's missing, to fill
289:30 system to know what's missing, to fill in the gaps, and to really keep progress
289:33 in the gaps, and to really keep progress moving forward while you manage on
289:35 moving forward while you manage on something else where where while you
289:36 something else where where while you focus on what matters most. And
289:39 focus on what matters most. And essentially, we want jewels to do the
289:40 essentially, we want jewels to do the dishes without being asked.
289:43 dishes without being asked. So most AI developer tools today are
289:45 So most AI developer tools today are fundamentally reactive. you open up your
289:47 fundamentally reactive. you open up your CLI or your ID and you ask the agent to
289:50 CLI or your ID and you ask the agent to do something and it responds or it waits
289:52 do something and it responds or it waits for you to start typing and then it
289:53 for you to start typing and then it autocompletes a suggestion. And there's
289:55 autocompletes a suggestion. And there's a benefit to this model. It's very
289:57 a benefit to this model. It's very efficient. It only uses compute when you
290:00 efficient. It only uses compute when you explicitly ask for it. But the real
290:02 explicitly ask for it. But the real question I'm asking myself is, is this
290:04 question I'm asking myself is, is this how I want to manage AI? And if you
290:06 how I want to manage AI? And if you think about in the future, imagine a
290:08 think about in the future, imagine a world where compute is not a limiting
290:11 world where compute is not a limiting factor anymore. Instead of a single
290:14 factor anymore. Instead of a single reactive assistant for instructions, you
290:16 reactive assistant for instructions, you could have dozens of small proactive
290:18 could have dozens of small proactive agents working with you in parallel,
290:21 agents working with you in parallel, quietly looking for patterns, noticing
290:23 quietly looking for patterns, noticing friction, and taking on the boring tasks
290:26 friction, and taking on the boring tasks that you don't want to do before you
290:28 that you don't want to do before you even ask. It can do things like fixing
290:31 even ask. It can do things like fixing authentication bugs that you've been
290:33 authentication bugs that you've been avoiding. uh updating configs, flagging
290:36 avoiding. uh updating configs, flagging potential order uh errors, preparing uh
290:40 potential order uh errors, preparing uh migrations and all of this can happen in
290:42 migrations and all of this can happen in the background triggered off of things
290:44 the background triggered off of things in my natural workflow. So I really
290:46 in my natural workflow. So I really think there are four essential
290:48 think there are four essential ingredients that make up proactive
290:49 ingredients that make up proactive systems today. There's observation. The
290:52 systems today. There's observation. The agent has to really continually
290:53 agent has to really continually understand what is happening and of what
290:56 understand what is happening and of what your code changes are, what your
290:58 your code changes are, what your patterns are, what your workflow is,
291:00 patterns are, what your workflow is, etc. to get context about your entire
291:01 etc. to get context about your entire project. And then there's
291:03 project. And then there's personalization. And this one's
291:04 personalization. And this one's difficult. It has to learn how you work,
291:06 difficult. It has to learn how you work, what you care about, what you tend to
291:08 what you care about, what you tend to ignore, what your preferences are, the
291:10 ignore, what your preferences are, the code that you absolutely don't want to
291:11 code that you absolutely don't want to ever touch. And then it has to be timely
291:14 ever touch. And then it has to be timely as well. If it comes in too soon, it's
291:16 as well. If it comes in too soon, it's going to interrupt you. And if it's too
291:17 going to interrupt you. And if it's too late, then the moment is lost. And it
291:19 late, then the moment is lost. And it also has to work seamlessly across your
291:22 also has to work seamlessly across your workflow. It has to insert itself into
291:24 workflow. It has to insert itself into spaces where you naturally work already
291:26 spaces where you naturally work already in your terminal, in your repository, in
291:28 in your terminal, in your repository, in your IDE, not forcing you to go
291:31 your IDE, not forcing you to go somewhere else to some application
291:33 somewhere else to some application that's secret or that you forgot about.
291:35 that's secret or that you forgot about. So bringing all these tools together,
291:36 So bringing all these tools together, you can imagine, is not trivial.
291:46 >> So is running this presentation. Um, and uh, you you want to be able to ask your
291:48 uh, you you want to be able to ask your agent to understand your workflow and
291:51 agent to understand your workflow and anticipate your needs and then intervene
291:53 anticipate your needs and then intervene at exactly the right moment without
291:55 at exactly the right moment without breaking your workflow.
291:57 breaking your workflow. And that's when it really starts to feel
291:59 And that's when it really starts to feel like magic. The interesting thing is pro
292:02 like magic. The interesting thing is pro these proactive systems, they're all
292:03 these proactive systems, they're all around us today. One of my favorite
292:05 around us today. One of my favorite examples is Google Nest where you put it
292:07 examples is Google Nest where you put it in your house, you install it, and then
292:09 in your house, you install it, and then you configure it and then it starts to
292:12 you configure it and then it starts to learn your habits as you leave the
292:14 learn your habits as you leave the house, as you come back, uh, as you go
292:16 house, as you come back, uh, as you go to sleep, as you wake up in the morning.
292:18 to sleep, as you wake up in the morning. And then pretty soon, you don't have to
292:20 And then pretty soon, you don't have to think about climate control in your
292:21 think about climate control in your house anymore because it's learned what
292:23 house anymore because it's learned what your habits are. Another one is your own
292:25 your habits are. Another one is your own body. your heart rate elevates as you go
292:27 body. your heart rate elevates as you go for a run or start to work out or it
292:30 for a run or start to work out or it anticipates that you're about to fall
292:32 anticipates that you're about to fall and so it reacts before you consciously
292:34 and so it reacts before you consciously think I'm going to put my hand out. So
292:37 think I'm going to put my hand out. So when you look at it like that
292:38 when you look at it like that proactivity is actually not that
292:40 proactivity is actually not that proactivity for AI is actually not that
292:42 proactivity for AI is actually not that futuristic. It's very familiar and it is
292:45 futuristic. It's very familiar and it is very human and that's exactly the point.
292:48 very human and that's exactly the point. What we're building is tools that behave
292:50 What we're building is tools that behave more like a good collaborator and less
292:53 more like a good collaborator and less like command line utilities. So we're
292:56 like command line utilities. So we're already doing this in this tool called
292:58 already doing this in this tool called jewels which is this uh proactive
293:00 jewels which is this uh proactive asynchronous autonomous coding agent
293:02 asynchronous autonomous coding agent from Google labs. And we're doing this
293:05 from Google labs. And we're doing this in kind of three levels of of uh
293:07 in kind of three levels of of uh proactivity. Level one is where a
293:10 proactivity. Level one is where a collaboration really starts to emerge.
293:12 collaboration really starts to emerge. And this is how Jules works today where
293:15 And this is how Jules works today where it can detect things like missing tests,
293:17 it can detect things like missing tests, unused dependencies, unsafe patterns,
293:19 unused dependencies, unsafe patterns, and then it starts to automatically fix
293:21 and then it starts to automatically fix those things as it's doing other other
293:23 those things as it's doing other other tasks that you've asked it to do. This
293:25 tasks that you've asked it to do. This is sort of like this attentive sue chef
293:27 is sort of like this attentive sue chef in your workflow where it's keeping the
293:29 in your workflow where it's keeping the kitchen clean, the knives sharp, the
293:31 kitchen clean, the knives sharp, the kitchen uh stocked so that you can focus
293:33 kitchen uh stocked so that you can focus on what comes next. And that's the
293:35 on what comes next. And that's the beginning of proactive software. At
293:38 beginning of proactive software. At level two, the agent becomes more
293:40 level two, the agent becomes more contextually aware of the entire
293:42 contextually aware of the entire project. It observes how you work, the
293:44 project. It observes how you work, the code you write. If you're a back-end
293:46 code you write. If you're a back-end engineer, maybe you need help with
293:47 engineer, maybe you need help with React. If you're a designer, maybe it
293:49 React. If you're a designer, maybe it wants you to may maybe it'll help uh uh
293:52 wants you to may maybe it'll help uh uh write the database schema. And then it
293:54 write the database schema. And then it learns what your frameworks are and what
293:57 learns what your frameworks are and what your deployment style is, etc. And this
293:59 your deployment style is, etc. And this is the kitchen manager. This is the
294:01 is the kitchen manager. This is the person in your workflow keeping the
294:03 person in your workflow keeping the rhythm and anticipating what you need
294:04 rhythm and anticipating what you need next. And then comes level three. And
294:07 next. And then comes level three. And this is what we're working on pretty
294:09 this is what we're working on pretty hard right now going into December. And
294:11 hard right now going into December. And I'll show you a little bit of what we're
294:12 I'll show you a little bit of what we're what we're going to be shipping in
294:13 what we're going to be shipping in December in a minute. But level three is
294:15 December in a minute. But level three is where things start to converge around
294:17 where things start to converge around that context. It's where the agent
294:19 that context. It's where the agent starts to understand not just context,
294:22 starts to understand not just context, but also consequence. How these choices
294:25 but also consequence. How these choices are actually affecting the users of your
294:27 are actually affecting the users of your products, the performance, and the
294:28 products, the performance, and the outcomes. And at that level, we have
294:31 outcomes. And at that level, we have this thing jewels. We also have an agent
294:33 this thing jewels. We also have an agent called Stitch, which is a design agent.
294:35 called Stitch, which is a design agent. and another one we're building called
294:36 and another one we're building called insights which is a data agent and
294:38 insights which is a data agent and they're all coming together to build
294:39 they're all coming together to build this collective intelligence across your
294:41 this collective intelligence across your application tools can see what's
294:43 application tools can see what's breaking in the software stitch
294:45 breaking in the software stitch understands how users are interacting
294:47 understands how users are interacting with it and insights connects behaviors
294:50 with it and insights connects behaviors from real world signals like analytics
294:53 from real world signals like analytics telemetry and conversion rates and then
294:55 telemetry and conversion rates and then together they can propose improvements
294:58 together they can propose improvements across boundaries of how the system all
295:00 across boundaries of how the system all works together doing things like
295:02 works together doing things like performance fixes to improve UX and then
295:04 performance fixes to improve UX and then design changes to prevent regressions
295:07 design changes to prevent regressions and then all of that is organized based
295:09 and then all of that is organized based on live data. So the trick here is that
295:13 on live data. So the trick here is that the human stays firmly in the loop.
295:14 the human stays firmly in the loop. You're observing what the agents are
295:16 You're observing what the agents are doing. You're refining when you when
295:18 doing. You're refining when you when they when you need to intervene and then
295:20 they when you need to intervene and then you're redirecting it when it has when
295:23 you're redirecting it when it has when it has been misdirected. So level three
295:26 it has been misdirected. So level three isn't really about autonomy anymore.
295:28 isn't really about autonomy anymore. It's actually about alignment to your
295:30 It's actually about alignment to your project. a a agents and humans
295:33 project. a a agents and humans collaborating together across the full
295:35 collaborating together across the full life cycle of your project.
295:39 life cycle of your project. So right now Jules is focused on this
295:41 So right now Jules is focused on this code awareness piece that understands
295:43 code awareness piece that understands the environment, the frameworks and the
295:45 the environment, the frameworks and the project structures and we're moving
295:47 project structures and we're moving towards more of that system awareness.
295:49 towards more of that system awareness. So things that we're introducing in
295:51 So things that we're introducing in Jules now, we've added something called
295:52 Jules now, we've added something called memory which I'm sure a lot of you are
295:54 memory which I'm sure a lot of you are familiar with. It's the ability for
295:56 familiar with. It's the ability for Jules to write its own memories and you
295:59 Jules to write its own memories and you can edit them and interact with them. it
296:01 can edit them and interact with them. it can edit them and it understands that
296:02 can edit them and it understands that and builds this memory and context and
296:05 and builds this memory and context and knowledge of of your project as you work
296:07 knowledge of of your project as you work with it. We've added a critic agent
296:09 with it. We've added a critic agent which works adversarially with Jules to
296:11 which works adversarially with Jules to make sure that the code is is high
296:13 make sure that the code is is high quality but then also does a full code
296:15 quality but then also does a full code review. And then we've added
296:17 review. And then we've added verification where Jules will write a
296:18 verification where Jules will write a playwright script, take a screenshot,
296:20 playwright script, take a screenshot, and then put that back into the
296:22 and then put that back into the trajectory for you to validate. And then
296:24 trajectory for you to validate. And then we're also doing things like adding uh a
296:27 we're also doing things like adding uh a to-do bot that will look through your
296:29 to-do bot that will look through your code and look through your repository
296:32 code and look through your repository and pick up on anything that where
296:34 and pick up on anything that where you've said this is a to-do I want to
296:35 you've said this is a to-do I want to get to in the future and it will start
296:37 get to in the future and it will start to proactively work on those things with
296:39 to proactively work on those things with that context. We're also adding in
296:41 that context. We're also adding in things like best practices where Jules
296:43 things like best practices where Jules will understand best practices and start
296:45 will understand best practices and start to suggest those and also environment
296:48 to suggest those and also environment setup. We have an environment agent that
296:50 setup. We have an environment agent that we use internally for running evals and
296:53 we use internally for running evals and we're extending that externally to
296:55 we're extending that externally to better understand how environment how
296:57 better understand how environment how your environments work and and set those
296:59 your environments work and and set those up for you. And then we also are adding
297:01 up for you. And then we also are adding something called a just in time context.
297:03 something called a just in time context. It's like a jewels cheat sheet where if
297:05 It's like a jewels cheat sheet where if it's doing something very specific it
297:07 it's doing something very specific it can and gets stuck it can just
297:09 can and gets stuck it can just immediately look at that cheat sheet
297:10 immediately look at that cheat sheet instead of reaching out to you. So, this
297:13 instead of reaching out to you. So, this is all moving Jules very close to being
297:15 is all moving Jules very close to being that proactive teammate, not just this
297:17 that proactive teammate, not just this reactive assistant. Okay, so this
297:21 reactive assistant. Okay, so this morning I was talking to my team back in
297:22 morning I was talking to my team back in San Francisco and I was thinking, okay,
297:25 San Francisco and I was thinking, okay, I'm going to do a live demo, but the
297:27 I'm going to do a live demo, but the live demo gods did not align with me
297:29 live demo gods did not align with me this morning. We still have CLS that are
297:30 this morning. We still have CLS that are being pushed to staging right now. So,
297:32 being pushed to staging right now. So, I'm going to walk you through a little
297:34 I'm going to walk you through a little bit of this. And if you know Jed, he's
297:36 bit of this. And if you know Jed, he's going to, I think, be talking tomorrow.
297:38 going to, I think, be talking tomorrow. We're gonna um affectionately try to fix
297:40 We're gonna um affectionately try to fix Jed's code here. Um, so this is a view
297:44 Jed's code here. Um, so this is a view of of proactivity and this is this is
297:47 of of proactivity and this is this is Jules where you prompt it and the first
297:49 Jules where you prompt it and the first thing you that you do when you configure
297:51 thing you that you do when you configure and enable proactivity is Jules will
297:53 and enable proactivity is Jules will index your entire uh codebase. It'll
297:56 index your entire uh codebase. It'll index your directory and start looking
297:57 index your directory and start looking for things that it can do and then it'll
297:59 for things that it can do and then it'll that'll show up on the screen. So right
298:02 that'll show up on the screen. So right here we're looking at a little bit more
298:05 here we're looking at a little bit more in this um in this repository ADK Python
298:08 in this um in this repository ADK Python and uh and it's indexed the repository
298:12 and uh and it's indexed the repository and it's found a bunch of to-dos. It's
298:14 and it's found a bunch of to-dos. It's found a bunch of best practices that it
298:16 found a bunch of best practices that it can update and it's giving me some
298:17 can update and it's giving me some signal about what it's finding. And so
298:19 signal about what it's finding. And so you can see the signal is high
298:21 you can see the signal is high confidence, medium confidence, and low.
298:23 confidence, medium confidence, and low. And so it's actually telling me what it
298:25 And so it's actually telling me what it thinks it can achieve based on what's in
298:28 thinks it can achieve based on what's in my code and what it wants to do. And
298:31 my code and what it wants to do. And that's so it has high confidence in
298:32 that's so it has high confidence in green, medium and purple, low in yellow
298:35 green, medium and purple, low in yellow way down at the bottom. Um, and so I can
298:37 way down at the bottom. Um, and so I can go through this and I can manually click
298:39 go through this and I can manually click these and say I want to start these. And
298:42 these and say I want to start these. And so I don't have to think about the
298:43 so I don't have to think about the prompt. I don't have to look at the
298:45 prompt. I don't have to look at the code. I don't I I can do kind of less
298:47 code. I don't I I can do kind of less cognitive load here. We're working on
298:50 cognitive load here. We're working on something to just start these
298:51 something to just start these automatically. And so that's coming in
298:53 automatically. And so that's coming in the future. But I can also delete these.
298:55 the future. But I can also delete these. I can say, "Hey, this one isn't isn't
298:56 I can say, "Hey, this one isn't isn't for me. Isn't good." And so once it gets
298:59 for me. Isn't good." And so once it gets started on a task, I can kind of drill
299:01 started on a task, I can kind of drill into it and see a little bit more. I can
299:03 into it and see a little bit more. I can peek into the code that it is suggesting
299:06 peek into the code that it is suggesting uh that uh it's suggesting it work on. I
299:10 uh that uh it's suggesting it work on. I can find the location of that code. And
299:11 can find the location of that code. And it also gives me some rationale about
299:15 it also gives me some rationale about why it wants to work on that code, why
299:16 why it wants to work on that code, why what it's doing, etc. And so it's giving
299:18 what it's doing, etc. And so it's giving me a lot more context and helping me
299:21 me a lot more context and helping me trust that it knows what to do here.
299:26 trust that it knows what to do here. Okay. So that's proactivity that's
299:28 Okay. So that's proactivity that's coming in December and hopefully we'll
299:31 coming in December and hopefully we'll be able to give that to everybody here.
299:33 be able to give that to everybody here. We're very excited about it and I want
299:35 We're very excited about it and I want to tell you a little story about uh
299:38 to tell you a little story about uh something my husband and I were working
299:39 something my husband and I were working on just to kind of set set wrap things
299:42 on just to kind of set set wrap things up. We uh tinker a bunch with hardware
299:46 up. We uh tinker a bunch with hardware and we live on this slow street in the
299:48 and we live on this slow street in the middle of San Francisco in Haydashberry
299:49 middle of San Francisco in Haydashberry district. And so on Halloween we get a
299:51 district. And so on Halloween we get a lot of people walking by our house and
299:53 lot of people walking by our house and so we are trying to take advantage of
299:55 so we are trying to take advantage of that with our Halloween decorations. And
299:57 that with our Halloween decorations. And so we built this six-foot animatronic
300:00 so we built this six-foot animatronic head that sits in the front of our
300:03 head that sits in the front of our house. It's this old Victorian house.
300:06 house. It's this old Victorian house. And he sculpted it out of foam, epoxy,
300:08 And he sculpted it out of foam, epoxy, and fiberglass. And then I our our kids
300:11 and fiberglass. And then I our our kids also called this lovingly the bald head.
300:14 also called this lovingly the bald head. And it's based off of if you ever see
300:16 And it's based off of if you ever see saw Peewee Herman from the 80s. It's
300:18 saw Peewee Herman from the 80s. It's based off of the Peewee Herman Peewee's
300:19 based off of the Peewee Herman Peewee's Big Adventures head. Um so while my
300:22 Big Adventures head. Um so while my husband was doing this I was spending my
300:25 husband was doing this I was spending my time working with Jules on updating the
300:27 time working with Jules on updating the firmware, controlling the stepper
300:28 firmware, controlling the stepper motors, working on the um on the LEDs
300:31 motors, working on the um on the LEDs and the sensors. And for me that's the
300:34 and the sensors. And for me that's the fun part for me is like really getting
300:35 fun part for me is like really getting creative with what the LEDs are doing.
300:38 creative with what the LEDs are doing. So I wanted to focus on that, the LED
300:40 So I wanted to focus on that, the LED animations, but I ended up spending most
300:42 animations, but I ended up spending most of my time actually fixing bugs and
300:45 of my time actually fixing bugs and swapping libraries and doing things like
300:47 swapping libraries and doing things like that. So what I would do is I would
300:48 that. So what I would do is I would prompt Jules, I'd wait 10 minutes and
300:51 prompt Jules, I'd wait 10 minutes and then I would repeat. And I found that
300:54 then I would repeat. And I found that process very very tedious. And what I
300:57 process very very tedious. And what I wanted was actually Jules to do the
300:59 wanted was actually Jules to do the research. I wanted it to handle the the
301:01 research. I wanted it to handle the the ugly parts where it was researching how
301:04 ugly parts where it was researching how to fix a bug. Uh doing the debugging
301:07 to fix a bug. Uh doing the debugging itself. And I wanted it to do this so
301:09 itself. And I wanted it to do this so that I could focus on the creative
301:10 that I could focus on the creative parts. I wanted the eyes to move and
301:13 parts. I wanted the eyes to move and like follow people as they walk down the
301:15 like follow people as they walk down the street and like have lasers coming in
301:17 street and like have lasers coming in out of its eyes and stuff like I
301:18 out of its eyes and stuff like I mentioned it was Halloween. It was very
301:19 mentioned it was Halloween. It was very scary. Uh and and this but but I
301:22 scary. Uh and and this but but I couldn't really do as much of that. and
301:24 couldn't really do as much of that. and I ended up actually not shipping as much
301:26 I ended up actually not shipping as much as I wanted to with this animatronic
301:29 as I wanted to with this animatronic bald head. And so it's that gap that we
301:33 bald head. And so it's that gap that we actually want to close. It's the space
301:35 actually want to close. It's the space between with jewels, it's the space
301:36 between with jewels, it's the space between that tool friction and creative
301:39 between that tool friction and creative freedom that we're trying to unlock with
301:42 freedom that we're trying to unlock with these kinds of proactive agents.
301:44 these kinds of proactive agents. So what I really want you guys to take
301:48 So what I really want you guys to take away from it, I give this advice to the
301:50 away from it, I give this advice to the the folks on on the Jules team a lot is
301:53 the folks on on the Jules team a lot is that the product we build today actually
301:56 that the product we build today actually won't be the project the products that
301:57 won't be the project the products that we have in the future and I think a lot
301:59 we have in the future and I think a lot of us know that but in reality I want
302:02 of us know that but in reality I want everybody in this room and everyone
302:03 everybody in this room and everyone building working with AI to be able to
302:06 building working with AI to be able to take those big steps. I think the
302:08 take those big steps. I think the patterns that we rely on today git uh
302:10 patterns that we rely on today git uh your your idees even the code how we
302:13 your your idees even the code how we think about the code itself might not
302:16 think about the code itself might not exist a year from now might not exist
302:18 exist a year from now might not exist six months from now and that's the
302:20 six months from now and that's the exciting part for me it's sort of we get
302:22 exciting part for me it's sort of we get to invent the future right now we get to
302:25 to invent the future right now we get to describe and decide how software is made
302:28 describe and decide how software is made and built uh kind of all the people in
302:30 and built uh kind of all the people in this room so my my challenge to you is
302:34 this room so my my challenge to you is to not be afraid to question the old
302:37 to not be afraid to question the old ways of how you're building software
302:38 ways of how you're building software because really the future is coming
302:41 because really the future is coming faster than any of us know. It's
302:43 faster than any of us know. It's probably already here and the cool thing
302:45 probably already here and the cool thing is we get to build it together. Thank
302:48 is we get to build it together. Thank you.
302:49 you. [applause]
302:51 [applause] [music]
302:53 [music] Our
303:01 next talk is a case study from the enterprise on incremental rollout of AI.
303:04 enterprise on incremental rollout of AI. Here to provide us with a blueprint for
303:06 Here to provide us with a blueprint for making AI transformation fundable,
303:08 making AI transformation fundable, governable and real inside large risk
303:11 governable and real inside large risk averse organizations is engineering
303:14 averse organizations is engineering leader at Northwestern Mutual, ASAF
303:17 leader at Northwestern Mutual, ASAF board.
303:23 >> [music] [applause]
303:32 >> Doesn't this look like something's going to drop from the ceiling? Like a ground
303:35 to drop from the ceiling? Like a ground zero type thing? [snorts] Be honest.
303:37 zero type thing? [snorts] Be honest. Like, who has a buzzer that if I'm I
303:39 Like, who has a buzzer that if I'm I really suck, they press it and
303:41 really suck, they press it and everything falls down through the trap
303:43 everything falls down through the trap door? No.
303:44 door? No. >> Be careful.
303:44 >> Be careful. >> Yeah. Okay. Who was it? Okay. you tell
303:48 >> Yeah. Okay. Who was it? Okay. you tell me if I'm doing okay or if I should take
303:49 me if I'm doing okay or if I should take a couple steps back. Right. So, hi
303:52 a couple steps back. Right. So, hi everyone. I'm Assaf. Um, and I'm here to
303:55 everyone. I'm Assaf. Um, and I'm here to talk about Genbi. And kind of first
303:58 talk about Genbi. And kind of first disclaimer, this presentation was not
304:00 disclaimer, this presentation was not created with Genai. Um, to be honest, I
304:03 created with Genai. Um, to be honest, I actually started doing it uh with uh
304:06 actually started doing it uh with uh GPT03 back in August. Uh, [snorts] and
304:09 GPT03 back in August. Uh, [snorts] and then I did kind of a first draft and
304:11 then I did kind of a first draft and then a couple of weeks back I wanted to
304:13 then a couple of weeks back I wanted to come in and refresh it before the
304:15 come in and refresh it before the conference and then GPT5 took over
304:18 conference and then GPT5 took over completely messed up my slide so I ended
304:21 completely messed up my slide so I ended up doing it manually kind of
304:22 up doing it manually kind of oldfashioned. So if I'm missing like an
304:25 oldfashioned. So if I'm missing like an M dash somewhere in the middle let me
304:27 M dash somewhere in the middle let me know after. Okay. [snorts] Uh, so first
304:30 know after. Okay. [snorts] Uh, so first of all a bit of housekeeping. What's
304:31 of all a bit of housekeeping. What's GenBI? So it's a fusion of Gen AI and
304:34 GenBI? So it's a fusion of Gen AI and BI. It's basically an agent that helps
304:37 BI. It's basically an agent that helps people answer business questions with
304:39 people answer business questions with data like a a business intelligence
304:42 data like a a business intelligence person would do in real life. Uh the
304:44 person would do in real life. Uh the reason that we're pursuing GenBI is
304:46 reason that we're pursuing GenBI is really because of the data
304:47 really because of the data democratization that it can bring.
304:49 democratization that it can bring. Right? So having access to data at your
304:52 Right? So having access to data at your fingertips without having to be reliant
304:54 fingertips without having to be reliant on a BI team that helps you find a
304:56 on a BI team that helps you find a report, figure out what it means, uh
304:58 report, figure out what it means, uh understand your world before they can
305:00 understand your world before they can even give you any kind of input. Uh, so
305:03 even give you any kind of input. Uh, so that's Genbi. Uh, a bit about
305:05 that's Genbi. Uh, a bit about Northwestern Mutual. That's where I
305:07 Northwestern Mutual. That's where I work. So, we're a financial services,
305:10 work. So, we're a financial services, life insurance, and wealth management.
305:11 life insurance, and wealth management. Been around for 160 years. Uh, [snorts]
305:14 Been around for 160 years. Uh, [snorts] some very impressive numbers there. But
305:16 some very impressive numbers there. But first of all, I want to say why is
305:18 first of all, I want to say why is Northwestern Mutual a great place to do
305:20 Northwestern Mutual a great place to do Gen AI? We got a lot of data, we got a
305:23 Gen AI? We got a lot of data, we got a lot of money, we got a lot of use cases,
305:26 lot of money, we got a lot of use cases, and we got access to some of the best
305:27 and we got access to some of the best talent uh, anyone can dream of. really
305:30 talent uh, anyone can dream of. really truly humbled by the people that I get
305:32 truly humbled by the people that I get to work with. Um, but on the flip side,
305:36 to work with. Um, but on the flip side, why is it hard to do Gen AI at
305:38 why is it hard to do Gen AI at Northwestern Mutual? Because it is a
305:40 Northwestern Mutual? Because it is a very riskaverse company, right? If you
305:43 very riskaverse company, right? If you think about it, our main motto is
305:45 think about it, our main motto is generational responsibility. I call it
305:48 generational responsibility. I call it don't f up. Uh, because what we end
305:51 don't f up. Uh, because what we end up selling to people is a decadesl long
305:55 up selling to people is a decadesl long commitment, right? you buy life
305:58 commitment, right? you buy life insurance now,
306:00 insurance now, uh, if you stay with us until it comes
306:03 uh, if you stay with us until it comes to term, so to speak, that can be 20,
306:06 to term, so to speak, that can be 20, 40, 80 years down the line, depending on
306:09 40, 80 years down the line, depending on when you buy it and how long you get to
306:10 when you buy it and how long you get to live. And so stability is something
306:13 live. And so stability is something that's very important for us because
306:15 that's very important for us because it's important for our clients. So, how
306:17 it's important for our clients. So, how do we balance stability with innovation?
306:20 do we balance stability with innovation? That's what I want to talk about today.
306:22 That's what I want to talk about today. Um, and really the four main challenges
306:26 Um, and really the four main challenges that we had when we even came up with
306:28 that we had when we even came up with the idea kind of a pie in the sky Genbi
306:31 the idea kind of a pie in the sky Genbi concept. Uh, first [snorts] of all, no
306:34 concept. Uh, first [snorts] of all, no one's done it before, right? Truly, no
306:36 one's done it before, right? Truly, no one's done GenBI in this fashion in the
306:38 one's done GenBI in this fashion in the past. Uh, secondly, and this was really
306:41 past. Uh, secondly, and this was really a preference for us, we wanted to use
306:44 a preference for us, we wanted to use actual data that's messy because we knew
306:47 actual data that's messy because we knew that those were that's where the real
306:50 that those were that's where the real challenges are going to be, right?
306:51 challenges are going to be, right? understanding actual messy data for 160
306:54 understanding actual messy data for 160 year old company and how can we perform
306:57 year old company and how can we perform well within that ecosystem. Um the third
307:00 well within that ecosystem. Um the third was kind of a blind trust bias. So um
307:05 was kind of a blind trust bias. So um the bi the trust that we had to build
307:07 the bi the trust that we had to build was both with the users but also with
307:09 was both with the users but also with the leadership of the company, right?
307:11 the leadership of the company, right? How can we bring accurate information,
307:14 How can we bring accurate information, accurate answers to people when uh all
307:17 accurate answers to people when uh all of these things that we know about and
307:19 of these things that we know about and everyone's talked about is is just out
307:21 everyone's talked about is is just out there, right? No one's blind to the
307:23 there, right? No one's blind to the trust barriers. No one's blind to the
307:24 trust barriers. No one's blind to the accuracy barriers. So, how do we
307:26 accuracy barriers. So, how do we convince that this is actually something
307:28 convince that this is actually something that we can trust in the company? And
307:31 that we can trust in the company? And lastly,
307:33 lastly, um but really firstly, when we go to
307:36 um but really firstly, when we go to approach this from an enterprise
307:37 approach this from an enterprise perspective, budget impact, right? How
307:40 perspective, budget impact, right? How do we convince someone in a leadership
307:42 do we convince someone in a leadership uh organization where risk averseman is
307:46 uh organization where risk averseman is ingrained in the DNA to even invest in
307:49 ingrained in the DNA to even invest in something like this that no one's done
307:51 something like this that no one's done before? We don't really know how we
307:53 before? We don't really know how we would do it. Uh we're not even sure how
307:54 would do it. Uh we're not even sure how it would look like when it comes to
307:56 it would look like when it comes to term.
307:58 term. Uh so I'll start kind of one by one uh
308:01 Uh so I'll start kind of one by one uh and first of all really talk about why
308:03 and first of all really talk about why we chose to use actual data uh and not
308:06 we chose to use actual data uh and not synthesized data or cleanse data. Uh
308:08 synthesized data or cleanse data. Uh [snorts] so really it's about making
308:10 [snorts] so really it's about making sure that we understand the actual
308:11 sure that we understand the actual complexities that we will have to face
308:14 complexities that we will have to face when we eventually want to go to
308:16 when we eventually want to go to production right we know that you know
308:18 production right we know that you know building uh PC's and demos is so easy
308:20 building uh PC's and demos is so easy but the gap from PC to production is so
308:23 but the gap from PC to production is so broad uh especially in this genai space
308:26 broad uh especially in this genai space especially because we don't know upfront
308:28 especially because we don't know upfront how to design the system what we would
308:30 how to design the system what we would expect it to behave like so making sure
308:33 expect it to behave like so making sure that we operate with real data just gave
308:35 that we operate with real data just gave us that extra confidence that when
308:36 us that extra confidence that when something works in the it's very likely
308:39 something works in the it's very likely to also work in reality. Uh but also and
308:42 to also work in reality. Uh but also and maybe not uh in the least less important
308:46 maybe not uh in the least less important is that we got to work with actual
308:48 is that we got to work with actual people who work with the data day in and
308:50 people who work with the data day in and day out and that gave us two things.
308:52 day out and that gave us two things. Okay, first of all subject matter
308:54 Okay, first of all subject matter expertise which are super critical for
308:56 expertise which are super critical for us to be able to validate that the
308:57 us to be able to validate that the system is actually working gave us a lot
309:00 system is actually working gave us a lot of real life examples of what people are
309:02 of real life examples of what people are actually asking in a corporate and what
309:04 actually asking in a corporate and what people have answered to them. So
309:06 people have answered to them. So basically the eval right and all the
309:08 basically the eval right and all the testing and stuff uh but at the end of
309:11 testing and stuff uh but at the end of the day it also brought the business to
309:15 the day it also brought the business to be a part of the research project itself
309:18 be a part of the research project itself and they became kind of bought into the
309:20 and they became kind of bought into the idea as part of the process. So we
309:23 idea as part of the process. So we didn't just test something in the lab
309:24 didn't just test something in the lab and then had to convince someone to go
309:26 and then had to convince someone to go ahead and use it. The end users were
309:29 ahead and use it. The end users were part of the research process itself. And
309:32 part of the research process itself. And so when eventually it matured enough so
309:34 so when eventually it matured enough so we can take some of that to production,
309:37 we can take some of that to production, they were already there and they
309:38 they were already there and they actually were pulling that. They told us
309:40 actually were pulling that. They told us we want to take this, how can we wrap
309:42 we want to take this, how can we wrap it? How can we package it uh quickly
309:44 it? How can we package it uh quickly enough so we can put it into practice?
309:50 Uh and the next part was really about building trust. Uh so this is about
309:53 building trust. Uh so this is about building trust first of all with our
309:55 building trust first of all with our management team. right now. I don't know
309:57 management team. right now. I don't know about you, but last time that I got a
309:59 about you, but last time that I got a million dollar to do a research project
310:01 million dollar to do a research project that I wanted in a pie sky idea, I woke
310:04 that I wanted in a pie sky idea, I woke up from the dream and I realized that
310:06 up from the dream and I realized that this is not how things work in reality.
310:08 this is not how things work in reality. You don't just get a million dollars and
310:10 You don't just get a million dollars and go ahead and try something out. Uh you
310:13 go ahead and try something out. Uh you had to show that you know what you're
310:14 had to show that you know what you're doing. And part of what we did, it's
310:17 doing. And part of what we did, it's kind of listed out here, but obviously,
310:19 kind of listed out here, but obviously, you know, we did all the regular stuff,
310:21 you know, we did all the regular stuff, right? We worked in a sandbox
310:22 right? We worked in a sandbox environment. We made sure that we're not
310:24 environment. We made sure that we're not using actual client data. We made sure
310:26 using actual client data. We made sure to put in all the security risks aside,
310:29 to put in all the security risks aside, but uh one of the first approaches that
310:31 but uh one of the first approaches that we said we're going to take is we're not
310:33 we said we're going to take is we're not just going to build a tool that's going
310:35 just going to build a tool that's going to be uh released to everyone, right? We
310:38 to be uh released to everyone, right? We understood very quickly that um how
310:43 understood very quickly that um how people interact with the tool, their
310:45 people interact with the tool, their ability to verify that what they're
310:47 ability to verify that what they're getting is right and also give us
310:49 getting is right and also give us feedback changes dramatically depending
310:51 feedback changes dramatically depending on their expertise and understanding of
310:53 on their expertise and understanding of the data. So we took that crawl, walk,
310:55 the data. So we took that crawl, walk, run approach that basically said we're
310:58 run approach that basically said we're first going to release it to actual BI
311:01 first going to release it to actual BI experts, right? People that would be
311:04 experts, right? People that would be able to do it on their own and know what
311:06 able to do it on their own and know what good looks like when they get it. and
311:07 good looks like when they get it. and we're just going to expedite the process
311:09 we're just going to expedite the process for them. Kind of like a GitHub
311:10 for them. Kind of like a GitHub co-pilot. The next phase would be to
311:13 co-pilot. The next phase would be to bring it to business managers. And
311:15 bring it to business managers. And again, people who are closer to the BI
311:17 again, people who are closer to the BI team, but when they see a mistake, they
311:20 team, but when they see a mistake, they can pretty much figure out that what
311:22 can pretty much figure out that what they're seeing is wrong because they're
311:24 they're seeing is wrong because they're used to seeing that on day-to-day basis.
311:26 used to seeing that on day-to-day basis. Um, and they will might be less
311:28 Um, and they will might be less sensitive to these types of mistakes and
311:30 sensitive to these types of mistakes and be more inclined to give us that
311:31 be more inclined to give us that feedback instead of just, you know,
311:33 feedback instead of just, you know, dumping it aside and never using it
311:35 dumping it aside and never using it again. giving this type of tool to
311:37 again. giving this type of tool to executives in the company. I don't even
311:39 executives in the company. I don't even know when we're going to get there,
311:41 know when we're going to get there, right? Like an executive, they want
311:43 right? Like an executive, they want clear, concise answers that they know
311:46 clear, concise answers that they know they can trust. We're definitely not
311:48 they can trust. We're definitely not there yet. I think that's the vision uh
311:50 there yet. I think that's the vision uh at some point in time, but the system is
311:52 at some point in time, but the system is not accurate enough for us to get there.
311:54 not accurate enough for us to get there. Maybe it never will be.
311:56 Maybe it never will be. Um, [snorts] another way that we another
311:58 Um, [snorts] another way that we another liver that we kind of used to build
312:01 liver that we kind of used to build inherent transit the system is that we
312:03 inherent transit the system is that we said well in the get-go we're not going
312:06 said well in the get-go we're not going to even try to build SQLs right this is
312:10 to even try to build SQLs right this is very complex this is very hard even for
312:13 very complex this is very hard even for a person so we said step number one
312:15 a person so we said step number one let's just bring information that is
312:18 let's just bring information that is already in the ecosystem that's already
312:20 already in the ecosystem that's already verified right we have a lot of uh
312:22 verified right we have a lot of uh certified reports and dashboards um and
312:25 certified reports and dashboards um and actually in the conversation we had with
312:27 actually in the conversation we had with some of the BI teams that we worked
312:29 some of the BI teams that we worked with. They told us guys like 80% of the
312:32 with. They told us guys like 80% of the work that we do is basically sending
312:33 work that we do is basically sending people to the right report and helping
312:35 people to the right report and helping them figure out how to use it. So the
312:37 them figure out how to use it. So the report is already there. Um and that
312:40 report is already there. Um and that again built some inherent trust into how
312:43 again built some inherent trust into how we architected the system because we
312:44 we architected the system because we said we're not going to make up
312:46 said we're not going to make up information. We're just going to deliver
312:48 information. We're just going to deliver you the same asset that you would have
312:49 you the same asset that you would have gotten anyway just in a much faster much
312:52 gotten anyway just in a much faster much more interactive way. uh and that was
312:54 more interactive way. uh and that was the alignment of expectations that we
312:56 the alignment of expectations that we did very upfront with the uh users and
312:59 did very upfront with the uh users and also with the management team.
313:01 also with the management team. Now [clears throat]
313:03 Now [clears throat] the biggest um
313:07 the biggest um process or kind of the most important
313:09 process or kind of the most important approach that we took when uh
313:11 approach that we took when uh approaching our leadership team and
313:12 approaching our leadership team and convincing them that we want to do this
313:15 convincing them that we want to do this was to create a very gradual incremental
313:18 was to create a very gradual incremental process that gave them a lot of
313:20 process that gave them a lot of visibility and control. [snorts] Uh and
313:23 visibility and control. [snorts] Uh and it was very important for us to build
313:25 it was very important for us to build incremental deliveries throughout that
313:27 incremental deliveries throughout that process so that uh not only did they
313:31 process so that uh not only did they have the the visibility into what are we
313:33 have the the visibility into what are we funding now, what do we get out of it,
313:35 funding now, what do we get out of it, they actually had business deliverables
313:37 they actually had business deliverables they could realize potential from
313:39 they could realize potential from throughout the process and at any point
313:42 throughout the process and at any point in time they could pull the plug right