0:01 There are a lot of really technical
0:03 tricky concepts in AI engineering.
0:05 Today, we're explaining them in the
0:07 absolute simplest way possible, like
0:09 you're 5 years old. Okay, realistically,
0:11 maybe like 15 years old, but you get the
0:12 idea. This video won't get into the
0:15 details, obviously. Instead, my goal is
0:16 that by the end, you have a highle
0:19 intuition for the main topics in AI
0:20 engineering today, so you can dig into
0:22 more technical tutorials with
0:24 confidence. Let's get started. So, what
0:26 is AI engineering? AI engineering is the
0:28 process of building applications with
0:30 readily available foundation models.
0:32 We'll talk about foundation models in a
0:34 minute. In practice, AI engineers start
0:36 from off-the-shelf models, often via an
0:38 API, adapt them with prompting, or one
0:40 of the other techniques we'll talk
0:41 about, and deploy them somewhere for
0:43 people to use. They make sure the AI
0:45 powered application has proper
0:47 evaluation, monitoring, security guard
0:49 rails, cost controls, and good enough
0:51 performance. This is different from
0:52 machine learning engineering which
0:54 focuses more on creating and improving
0:56 models themselves working with data
0:58 training models and optimizing
1:00 architectures and metrics. So in the
1:02 definition of AI engineering we use the
1:04 term foundation model. A foundation
1:06 model is a large AI model trained on a
1:08 big data set like text, images and
1:09 videos from the internet that can be
1:12 adapted to many downstream tasks. It's a
1:13 generalbased model that you can
1:16 customize for specific uses like a
1:18 support bot or coding assistant instead
1:20 of training from scratch. The foundation
1:22 label highlights that these are powerful
1:23 building blocks, but still incomplete
1:25 without being adapted to a specific
1:27 task. This might sound like an LLM, and
1:29 that makes sense. Large language models
1:31 are a type of foundation model that are
1:34 trained to guess the next piece of text
1:36 after reading huge amounts of writing.
1:37 Because they get very good at these
1:39 guesses, they can summarize, answer
1:42 questions, translate, and write code.
1:43 They don't retrieve facts like a
1:46 database. They encode lots of real world
1:48 knowledge in their parameters and
1:50 generate likely text so they can be
1:52 right but also confidently wrong. We've
1:54 had text models for a long time but the
1:57 main reason modern LLMs perform so well
1:58 is because of a change to the way the
2:00 model works. Specifically the
2:01 introduction of the transformer
2:03 architecture. The transformer
2:05 architecture is a model design that
2:07 allows training to happen in parallel
2:09 which makes very large models practical.
2:11 Most modern foundation models use this
2:13 design. This model design also lets each
2:16 word in a sentence pay attention to
2:18 other words, not just the ones next to
2:20 it. This makes it good at handling long
2:23 or tricky sentences and connections like
2:25 the dog that chased the cat was brown.
2:28 The attention mechanism is how a model
2:30 decides which part of the input matters.
2:33 Multiple attention heads can focus on
2:35 different things at once, like who a
2:37 pronoun refers to or the tone of a
2:39 sentence. This learning usually means
2:41 clearer, more accurate outputs. Speaking
2:43 of learning, if you want to go from
2:45 understanding these concepts at a high
2:46 level to actually building AI
2:48 applications, you'll need hands-on
2:50 practice. That's where today's sponsor,
2:52 Data Camp, comes in. Data Camp has two
2:54 excellent AI engineering tracks that
2:55 I've been really impressed with. First
2:57 is their associate AI engineer for data
2:59 scientists track. It's a series of
3:00 courses covering everything from machine
3:02 learning fundamentals to transformers,
3:04 prompt engineering, and fine-tuning. But
3:06 what really sets it apart is the MLOps
3:08 coverage. You'll learn MLflow, version
3:10 control with Git, automated testing, and
3:12 CI/CD concepts that most courses
3:14 completely skip. For developers, there's
3:16 also the associate AI engineer for
3:18 developers track. 12 courses plus
3:20 projects where you'll actually build
3:22 real applications like chat bots,
3:24 semantic search engines using OpenAI
3:27 API, HuggingFace, Langchain, and Pine
3:28 Cone. Again, they include the crucial
3:30 deployment and monitoring skills you'll
3:32 actually need in production. What I love
3:33 about Data Camp is it's all browser
3:34 based with interactive coding
3:36 environments so you can start practicing
3:38 immediately without any setup. Plus,
3:39 they have built-in AI helpers to guide
3:41 you when you're stuck. Check out Data
3:42 Camp using the link in the description
3:44 to start building these AI engineering
3:45 skills hands-on. Now, back to the
3:47 concepts. One thing that comes up a lot
3:49 in AI and ML is the idea of a model
3:51 learning. When we say a model is
3:53 learning, we mean that it's updating its
3:56 parameters. Parameters are the internal
3:57 numbers that control the model's
3:59 behavior. During training, the computer
4:01 adjusts these numbers until the model
4:03 makes fewer mistakes. Model parameters
4:05 can capture more patterns, but they also
4:07 cost more to store and run. Model
4:09 parameters are the numbers that the
4:10 model learns during training.
4:13 Hyperparameters are numbers that we set.
4:14 One important setting is called
4:16 temperature. Think of it like a
4:18 creativity dial. Low temperature makes
4:20 the model stick to safe, predictable
4:22 answers, which is great when you need
4:24 accurate facts. High temperature makes
4:26 it more creative and surprising, perfect
4:28 for brainstorming or writing stories,
4:30 but riskier if you need more precise
4:32 information. Temperature works with two
4:35 other controls called top K and top P.
4:37 These limit which words the model can
4:39 choose next. Top K says only pick from
4:42 the K most likely words. So if K is
4:44 five, the model can only choose from its
4:47 top five guesses. Top P is a little bit
4:49 smarter. It builds a pool of words until
4:52 their combined likelihood hits a certain
4:55 percentage, like 90%. This pool grows or
4:57 shrinks depending on how confident the
4:59 model is, giving you a nice balance of
5:01 consistency and creativity. Although
5:03 technically models aren't actually
5:05 returning words, they're returning
5:07 tokens. A token is like a small chunk of
5:09 text. Sometimes it's a whole word like
5:11 cat. Sometimes it's just part of a word
5:14 like un from unhappy. And sometimes it's
5:16 punctuation. The model reads and writes
5:18 one token at a time. Kind of like how
5:19 you might read letter by letter when
5:21 you're first learning to read. When
5:22 people talk about how long a model's
5:25 memory is, they're counting tokens, not
5:28 words. Speaking of memory, model context
5:30 is basically how much text the model can
5:32 remember and work with at once. Think of
5:34 it like your working memory when reading
5:36 a book. You can keep track of the
5:38 current chapter and maybe the last few
5:40 chapters, but not the entire book series
5:42 at once. Context includes your
5:44 conversation history, the prompt, any
5:46 documents you've shared, and the
5:48 response being generated. Models have
5:50 context limits. When you hit that limit,
5:52 the model starts forgetting the oldest
5:54 parts to make room for new information.
5:55 This is why very long conversations
5:57 sometimes lose track of things you said
5:58 at the beginning. When we chat with a
6:00 model, we are prompting the model,
6:02 right? And the prompt makes a big
6:03 difference for the kind of output we
6:05 get. Prompt engineering is a fancy way
6:07 of saying writing really good
6:09 instructions. You can tell the model
6:11 what role to play, like you are a
6:13 helpful teacher, what format you want,
6:15 like give me bullet points, and what
6:18 rules to follow, like always include
6:20 your sources. Good prompts are like
6:22 giving clear directions. They help the
6:24 model understand what you want and give
6:26 more consistent results. There are
6:28 actually two types of prompts. The
6:30 system prompt is like the house rules.
6:32 It sets the model's default behavior and
6:34 stays the same for every conversation.
6:37 The user prompt is your specific request
6:39 right now. Think of the system prompt as
6:41 telling someone you're a professional
6:43 email writer and the user prompt as
6:45 write an email declining this meeting.
6:46 Sometimes you don't need to give
6:48 examples in your prompt. You just ask
6:50 the model to do something and it figures
6:52 it out. That's called zeroot learning.
6:55 Like asking translate this to Spanish
6:57 without showing any examples first. But
6:59 often showing a few examples works way
7:01 better. That's fshot learning. You
7:03 include a handful of examples in your
7:05 prompt to show the exact style you want.
7:07 The model then copies your pattern for
7:09 your new request. This is all part of
7:11 something called in context learning.
7:12 You're not permanently changing the
7:14 model. You're just teaching it
7:16 temporarily within one conversation by
7:18 including examples right in your
7:19 message. If you want more permanent
7:22 changes, you can use fine-tuning. This
7:24 actually retrains the model on your own
7:26 examples, so it consistently behaves a
7:28 certain way. Unlike prompting, this time
7:30 you're actually changing the model's
7:32 internal parameters. It's useful for
7:34 specialized language like medical or
7:36 legal writing or getting a very specific
7:38 tone or output format. It costs more
7:40 time and money than prompting, but
7:41 you'll probably get more reliable
7:43 results. Full fine-tuning can be
7:45 expensive, so there's a shortcut we use
7:48 called PFT, parameter efficient
7:50 fine-tuning. Instead of changing the
7:52 entire model, you just add a small
7:55 adapter layer on top. It's like editing
7:57 just part of a document versus rewriting
7:59 the entire document from scratch. You
8:01 get most of the benefits of fine-tuning
8:03 with way less compute and storage. Laura
8:07 is an example of PEFT. Two other ways to
8:08 make models more practical are
8:10 quantization and distillation.
8:12 Quantization is like compressing a
8:14 highresolution photo. You store the
8:17 model's numbers with fewer bits, which
8:19 makes it smaller and faster while
8:20 keeping most of the quality.
8:22 Distillation is different. In this case,
8:25 you train a smaller student model to
8:27 copy a larger teacher model. The student
8:29 learns from both the teacher's answers
8:32 and how confident it is about them. You
8:34 end up with a faster, lighter model that
8:35 keeps much of the original's knowledge
8:38 and capabilities. Most consumer AI
8:40 models go through one more step called
8:42 preference fine-tuning. Humans rate
8:44 different model responses, and the model
8:46 learns to prefer answers that people
8:47 like. This pushes it toward being more
8:49 helpful, safe, and polite rather than
8:51 just technically correct. It's the
8:52 difference between a model that can
8:54 write and one that writes in a way
8:56 humans actually want to read. or the
8:57 difference between a sickopantic model
8:59 and one that is more down to earth. Now,
9:00 even the best models sometimes make
9:02 stuff up or don't know recent
9:04 information. That's where Rag comes in.
9:06 Retrieval augmented generation. Instead
9:08 of just relying on what the model
9:09 memorized during training, it first
9:11 looks up relevant documents maybe from a
9:13 database, then writes its answer using
9:16 that fresh information. This reduces
9:18 madeup facts and keeps answers current
9:19 without having to retrain the whole
9:21 model. Rag depends on something called
9:23 embeddings. An embedding turns text into
9:26 a list of numbers where similar meanings
9:28 end up close together mathematically. So
9:30 car and automobile would have very
9:31 similar lists of numbers even though
9:33 they're very different words. This lets
9:35 you search by meaning, not just exact
9:37 word matches. These embeddings get
9:39 stored in a vector database, which is
9:41 basically a database that's really good
9:43 at quickly finding the most similar
9:44 embeddings to your question. When you
9:46 ask something, the system finds the
9:48 closest matching documents and feeds
9:50 them to the model along with your
9:52 question. Before documents go into the
9:54 vector database, they get split up
9:55 through chunking. You can't stuff an
9:57 entire book into the model at once
9:58 because of the context limit. So, you
10:00 break it into smaller pieces that are
10:02 easier to search and process. If you go
10:03 too big, you get a lot of irrelevant
10:05 stuff. If you go too small, you lose
10:07 this important context. After finding
10:09 relevant chunks, ranking puts them in
10:12 order of usefulness. The best evidence
10:14 goes to the top, which leads to shorter,
10:16 clearer, and more accurate answers.
10:17 Under the hood, different parts of these
10:20 systems use encoders and decoders. An
10:23 encoder turns text into those compact
10:25 numeric summaries that capture meaning,
10:27 perfect for search and understanding. A
10:29 decoder turns summaries back into human
10:32 text, one token at a time. Great for
10:34 generating responses. Some models only
10:36 encode, some only decode, and some do
10:38 both. So far, we've talked about models
10:40 that just chat, but agents can actually
10:42 do things. An agent is like an AI
10:44 assistant that can plan steps and take
10:45 actions to reach a goal. It might search
10:47 the web, read the results, do some
10:49 calculations, write up an answer, and
10:50 send it back to you. Agents can use
10:52 memory from past conversations, retry
10:54 when something fails, and adjust their
10:57 plans. Agents work by calling tools.
10:58 External abilities like web search,
11:01 calculators, code runners, email,
11:03 calendars, and databases. Tools let the
11:05 model move from just words to actual
11:07 actions. But how do agents actually use
11:08 tools? It's actually pretty simple. When
11:10 you build an agent, you give it access
11:12 to functions it can call. For example,
11:15 you might have a search web query
11:17 function or a send to subject body
11:20 function or a send email function. The
11:21 agent doesn't run this code directly.
11:24 Instead, it generates a special message
11:26 saying, "I want to call search web with
11:28 the query weather in Paris." And your
11:29 application code actually makes that
11:31 function call, then feeds the results
11:33 back to the agent. All of this leads to
11:35 inference, the moment when you actually
11:37 run the trained model to get an answer.
11:39 The model generates one token at a time
11:41 guided by your temperature and top K or
11:43 top P settings. Cost depends on how many
11:45 tokens you process and how big the model
11:47 is. There are two main ways to do
11:49 inference. Online inference serves
11:52 answers in real time to live users like
11:54 chat GPT responding as you type. It
11:55 needs to be fast and handle traffic
11:57 spikes smoothly. Batch inference
11:59 processes lots of items offline like
12:01 overnight jobs. You trade instant
12:02 responses for higher throughput and
12:04 lower costs. It's perfect for things
12:06 like classifying millions of reviews or
12:08 summarizing archives. For online
12:10 systems, latency matters a lot. Latency
12:12 is the delay between asking a question
12:14 and getting the first useful output.
12:16 Users definitely notice the difference
12:18 between 200 milliseconds and 2 seconds.
12:20 Streaming helps by showing partial
12:22 results immediately instead of waiting
12:24 for the complete answer. So, we can
12:25 evaluate how good our system is by how
12:27 fast it returns results. But how do we
12:29 know if the actual models we're using
12:31 are any good? One way is with model
12:32 benchmarks. Benchmarks are like
12:34 standardized tests that compare models
12:36 on skills like math, coding, reading,
12:38 comprehension, and safety. They're
12:39 helpful for tracking progress, but
12:41 they're not the whole story. Real world
12:43 performance, and human feedback still
12:44 matter most. When researchers want to
12:46 measure model performance more
12:48 precisely, they use specific metrics.
12:50 Here are three of the most common ones
12:52 in AI engineering. Perplexity, blue, and
12:55 rouge. Perplexity measures how surprised
12:56 a model is by text it hasn't seen
12:59 before. Lower perplexity means the model
13:01 predicted the text better. It's less
13:02 confused. Think of it like a reading
13:04 comprehension test. If you can predict
13:06 what comes next in a story, you probably
13:08 understood it well. Blue and rouge are
13:09 used for tasks like translation and
13:11 summarization. Blue compares a model's
13:14 output to reference correct answers by
13:16 counting matching words and phrases. If
13:18 a translation shares lots of words with
13:20 a professional human translation, it
13:21 gets a high blue score. Rouge is
13:24 similar, but focuses on recall. How much
13:25 of the important content did the model
13:27 capture? It's especially useful for
13:29 summarization. If a model's summary
13:30 includes most of the key points from the
13:32 original text, it gets a high rouge
13:34 score. These metrics aren't perfect. A
13:35 translation could have all the right
13:37 words but still sound weird, or a
13:38 summary could hit all the keywords but
13:40 miss the main point. But they give
13:42 researchers a quick automated way to
13:43 compare different models. Another way to
13:45 tell how good the model is is to use one
13:47 model to grade another model's answers.
13:50 An LLM as judge scores responses against
13:52 rules like followed instructions, used
13:55 reliable sources, or clear writing. It's
13:57 much faster and cheaper than using only
13:59 human reviewers. Though it's not
14:01 perfect, so spot checks are important.
14:03 Finally, as AI systems get more complex,
14:04 we need better ways for different parts
14:07 to work together. MCP or model context
14:09 protocol is like a universal adapter
14:11 that lets models, apps, and tools
14:13 connect through the same interface.
14:15 Instead of building custom integrations
14:16 for everything, you just plug into the
14:18 standard. It's about making AI systems
14:20 that actually work together instead of
14:22 being isolated little islands. And that
14:24 wraps up our list of the main topics in
14:26 AI engineering at a high level. I have a
14:28 comprehensive 76-inute crash course that
14:30 goes into way more detail, so check that
14:32 out next. Thank you so much for watching