This content demonstrates how to effectively manage and orchestrate machine learning workflows using Azure Machine Learning pipelines, emphasizing the benefits of breaking down complex processes into modular, reusable steps.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
within your machine learning process
if you don't have a proper pipeline to
orchestrate your ml steps which is from
gathering your data massaging and
pre-processing your data and feed the
machinery model
then unfortunately
this mess will happen to your machine
learning animal ops project and your data
data
wow
what a performance
wait
if you would know what i'm going to talk
about today you would prefer to make the
mess here
rather than your machine learning
project which you have invested your
time budget and resources
what do you mean
okay are you still writing down all your
machine learning steps as just one
single notebook and execute them um
um almost
almost
well yes i mean yeah then
then
that's not the best practice at all
because you cannot manage and scale your
steps properly
oh can you show me an example
of course and
hello everyone this is mg and we're back
with another video session which we are
going to showcase you how you can
leverage azure machine learning
pipelines to manage and orchestrate your
machine learning steps properly
let's go for it all right here is my
azure machine learning workspace and
welcome everyone to my notebooks so what
i'm going to do as you can see i have
already created a folder
called the myc taxi data regression
model and
as you might know this is pretty um
famous actual data set which is coming
from also azure open data set that i'm
going to fetch the data and train the
regression model through applying all
those pre-processing steps you need to
do through leveraging azure machine
learning pipeline so
so
i have already created a notebook i
assume um you're familiar with azure
machine learning workspace in general if
you're not make sure you check the first
video or part first of azure mlabs video
series in youtube channel that's a
playlist that has 10 parts and the first
part is start with a walkthrough about
azure machine learning workspace and the
components we will discuss about all
these components here
and we have talked about what is compute
how we can create one and how we can
create notebooks what is working
directory here so on and so forth so
assuming that you're already familiar
with and
make sure if you haven't watched the
video first and then
you can resume the video here
so i have a compute i have created
running called mg01 i think that's just
a cpu machine yes and
what i'm going to do i already have the
code i'll paste them here in my notebook and
and
i'll explain step by step what we are doing
doing so
so
what is the first step
certainly grabbing the data and what i
am doing here i am importing some of the
libraries and packages i need to
to for example work with pandas data
frame and from azure and open data set
as i mentioned you i am getting this nyc
taxi data for green and yellow taxi
and i'll tell you that what we are going
to predict using this data set so here
i'm just simply creating a data frame to
push the data in and then
i have
some date time definitions of from what
time to what time i'm going to grab the
data from this open data set
for this number of months and this would
be the maximum sample size that i want
to have just just to limit
the amount of data that i'm going to
load and that's it so i'm going to grab
the data for green taxi and call it
green df for data frame raw and i'm good
to go so let's execute that
and what we can do i'm actually putting
them separately so the same thing that
we can do exactly for
yellow taxi data the same thing and we
are going to actually merge them
together so as an example in your actual
use cases and projects
you might certainly have more than just
one data set or data store to actually
grab the data
merge them potentially do some
featurization steps on the top to come
up with the clean unified
features to to fit the machine learning
model that's why here we have as an
example more than just one data set the
first one was green taxi and the second
one is yellow taxi the same thing
i was able to create these two
data frames
now what we can do let's take a quick look
look
on these data frames that we just
created so just waiting for it to be
finished all done
and i'm creating a new cell
pasting the code here again
and run
so you can see here i have some
informational features about the number
of the passengers
what was the the time for drop off the
pickup time location the distance of
that the
the distance of the journey
to some latitude like the information
about the location and all the way you
can see one of these columns if i'm not
mistaken is actually about the fare or
cost that
taxi had
uh yes here's the fair amount what we
are going to do
we are going to train the regression
model to actually predict
this value so this would be our target
okay now we took a look at the data and
that's the time i'm going to
download this data locally and then we
want to upload it to a blob that we have
defined before as a blob store as a data store
store
um data store so here what i can do
um i'm just creating a directory locally
based on the type of the data you're
gathering green yellow
uh in a pocket format and you're good to
go so i'm
executing the code
and you can see that the data is written
to local folder
let's figure out where the data is so i
have defined a folder called data that's
here and inside that i should have green
and yellow folders
there you go these are here and inside
them you can see i have pocket files
that i grabbed it from azure open data set
set
um if you remember again from the
part first of azure emblems video series
we were talking about what is data set
and data stored from
azure machine learning
which are basically here data store and
data data store is just
a pointer
to a location that you have your data
there that can be a blob storage that
can be sql database that can be postgres
database uh that can be adless gen 2 or
azure data liquid storage n2 so on and
so forth and by default azure ml
is coming within a storage icon on back
end so
which is the default blob storage i have
defined here and i want to actually
create a data set out of what i downloaded
downloaded
and put in the default data store to
call the data set
let me actually paste the code and
explain it here better
what you can see here
simply with this command line i'm
connecting to my workspace
after doing so
i am actually
calling my default data store which is a blob
blob
and having that as a pointer here and
i'm just uploading
these files that i downloaded
to this target that i have which is my
default data store and that's it so i'm
going to download them there and
and
it's going to take a couple of seconds
probably all done
what are you going to do after now i
need to
register these data as a data set why
again this is certainly best practicing
practice in azure ml
any type of data that you're
using to
to train the models or do any preprocessing
preprocessing
make sure you actually register them as
a data set first and the reason is
with having this defined concept of data
set in azure ml you can version actually
visualize your data set you can
later on track and check out in the logs
what data set was being used to train my
model who created that data set and
whenever we create a data set
automatically azure ml created dashboard
that shows some statistical results of
my data set for example what is the
distribution of each feature what is the
mean max medium standard deviation so on
and so forth so
that being said and these are by the way
almost just some of the very high level
uh benefits of azure data set definition
but again check out that video that i
mentioned earlier
earlier uh but quickly recapping on why
we are registering that here
so my data set that i'm going to create
is a tabular data and i have already
defined in the path that i have
uploaded the data to the default store here
here
and we should be good to go after
and now i can just register it here with
saying this data that i deferred here
define here register within this
workspace that we defined that here
which is my azure machine workspace and
this is the name
that i'm going to define for this data
and
clicking on code actually before i go
further let me show you
what we have done with this registering
as a data set
uh to to make it more tangible so if i
go to data set
you can see that now
the yellow taxi data and green taxi data
is created for me
uh within the date that for example i
was recording the video and who actually
created that so on and so forth
and i told you that even if you click on
explorer it will show you a dashboard
that shows you some of the information about
about
the data if i click on profile
you can see the distribution is here the
mean mag count so on and so forth and
whatever i'm going to do on the top of
this data
it will be all tracked and logged under
the definition of this data set which is
fairly easy for me to follow up and have
traceability on what i'm doing in this
workspace within my machine learning process
process
so going back to the notebook
now you have a better understanding
about what is this data set exactly and
why you utilize that and or register that
that
next thing what i'm going to do
for our pipeline that we're going to execute
execute
we certainly need
a compute engine so here i am creating a
cpu cluster
you can actually create that using the
ui in the compute section again i showed
you in the previous video but here you
can just use the code
and i'm telling that if that compute
engine exists then skip if not create
that and this is the vm sizer type that
i want to have the maximum with four nodes
nodes
and just create inside this workspace
that i have
and this is my huml compute cluster name
and this is the configurations that i
have defined above
and it should be fine i think i've
created this before
yes so found existing cluster so it's
not creating a new one for me
perfect now we got the date we got the
data we registered the data as a data
set and now we have a compute engine as
a cluster to execute our quotes and
pipelines later
so what is next what i would like to do
i am going to actually define
uh my run configurations this is the
code oh no i pasted the same code again
let me copy that again from the other
window and paste it here
all right
so what is run configurations
well certainly for executing any code
any um
even i would say a python script you
certainly need some
python libraries packages that you have
used in your coding pre-processing or
training steps right
so what you can do in azure ml
you can define them basically as an
image like in docker using some
dependencies yaml file and then when you
execute this code anytime in inferencing
or anytime you want to execute this
pipeline again
that compute engine knows what is the
operating system requirement and
packages needed to execute your code
that's why regardless of which computer
i have used to develop this code and
test them when i pass this solution
through a devops process let's say for
staging environment production environment
environment
i wouldn't have necessarily any concerns
that if it fails or not because of the
necessary packages so again it is a
pretty must best practice
that make sure you define the
environment here in azure ml there are
some already predefined environment that
you can use them some packages already
installed there
here i want to manage my own basically
uh environment with adding the
libraries that i want so that's why here
i'm putting false and i should be fine
when i execute this code telling that my
run configuration is created
so that's the environment that i have
defined for running the pipeline that
later i'm going to execute
so what is next and now what i'm going
to do some very high level
pre-processing on the top of the data
again i'm going to showcase you what are
the columns we have and i just want to
select some of the columns that i want
to use for this
use case
and here you go i have
selected the columns that i want
and then
we will have the fun part getting
started so let me copy the code
what i'm going to do here actually
first i have defining
what is the
path that my script exists
and you might question
what are those scripts
remember i just told you that
we do not
create and develop all our codes all the
preprocessing steps in just one single notebook
notebook um
um
so here what i have done before
and grab the data actually and the
samples from the azure remote github
sample code
here let me find yeah if i go to scripts
this path and pip data
you can see i have all different python
codes that one of them clean the data
one of them filter the data one of the
merge normalize and transform it
so i have each of them on separate
python cones and now i'm going to create
a pipeline
that call these steps within a sequence
that i'm defining in my pipeline and
orchestrate all these steps end to end
and i can automate it
so what i'm doing here i am in high
level defining where where are these
folders python files which are here
i am renaming some of the columns just a
simple renaming and replacing these
values here
and then
i am defining
a pipeline data and the reason is i want to
to
define the and the
output after that i execute the first
python code which is killing.pi
this is the first step that i'm going to
execute in my azuma pipeline
and definitely there's the outcome
there's a clean data out of the
execution section of this file
so i need to define the output as a
pipeline data because later i want to
pass the output of this file to the next
step which is another code that does the filtering
filtering
so after defining that here what i'm doing
doing
i am defining a step
for my
azure machine learning pipeline what is
this step it's a python description step why
why
because i want to execute a python code
where is it that's actually the name here
here and
and
i want to have the output
as i have defined here on the top
these are some of the arguments that i
want to pass to this python file
which i want to pass the number of the
useful columns i define on above that
will show you then the name of the columns
columns
and the output that we have defined here
let me actually open this file to show
you how this should be defined inside
this clean.pi
so if i open this python code
you can see that i have defined we have
defined actually a function here doing
some pre-processing and then
you can see that here i have defined
some arguments for this python code
which is exactly the same thing that you
just saw in my main code
so i'm showing you that
within the definition of a step for your
azure ml pipeline
you can define some parameters in your
main i mean you clean that by any any
python code you're going to execute and
from your pipeline step
input some
data to define that as a value of the
arguments that you have to find
then i'm thinking that these are the
input data set well for the green taxi
data i'm going to
execute this clean file on the top and
and
just before i forget
this is a very interesting parameter lru
allow reuse which is equal to true that means
means
if i'm going to execute all my pipeline
again then of course the first step will
be executing this python code because
we're defining that here but if the data
is not changed and this is step going to
generate the same outcome then why
should i execute this step again so here
i'm telling to my adrenal pipeline that
if nothing change and when i'm going to
execute all my popcorn again
if the outcome gonna be the same so just
don't execute let's save some costs and
and compute that i have and i skip this
step and go to the next step
and the next step i can check the same
thing and so on and so forth so
so
that's the first step of my azure
machine pipeline that i defined which is
just simply calling this python code
push some arguments defining the input
and output this is the compute engine
that i want to use to execute this step
what does that mean it means
for each step of your hml pipeline you
can utilize different compute types
for pre-processing i can use cpu machine
which is this machine be defined on the
top but for training i can let's say use
a gpu machine another type of compute
that's how i can granularly scale each
step of my hrml over different computes
and different environments
why because here i'm saying wrong
conflict is equal to ml wrong config we
defined them here on the top if you remember
remember
i can define different packages for
let's say filter.pi merge that by
different ground config environment so i
have full flexibility on that
i know i explained a lot here because it
was very important but for the rest it
should be very fast so very similarly
i'm doing the same
again just showcasing you as an example
the same thing now this time for the
yellow data taxi data instead of the
green one and
almost everything is exactly the same
except just the data frame is different
so did i execute this one no
how did i forget so okay
this one done
and by the way when i execute it doesn't
mean i'm running this python code not at
all i am just defining the steps of my
html pipeline and we haven't finished
the aml pipeline image definition yet
so let's create another cell
and what i'm going to do
let's say after cleaning these data now
i want to merge
this green
taxi data with yellow text data so
that's the next step that i'm going to
have in my pipeline as a step
here is the code
the same thing i am defining this
because i want to have the output of
this merge.pi as a pipeline data to pass
this outcome to the next step
i'm saying that this is my python code
under the same path i defined on the top
the same thing i am passing some
arguments as an input which is my merge
data sorry as an output and these are my inputs
inputs
which are those pocket files and
the same thing i already explained all
of that so just
again a different
step from the same type which is a
now the next one is
filter.pi which is here again you might
guess that we need to use the same
python skip step because yes we are
going to call
another python
script in our pipeline
the same thing again defining that as a
python data
these are my input the arguments that
this script needs so i'm passing them
through here hello reuse true and
and execute
execute
again i'm not running any code i'm just
defining my steps next one
pretty fast normalize that i'm going to
call this python code and define the
outcome as a python data i can just give
it a name and
the same thing the rest is very similar
just the inputs and output are it'll
study different because they're coming
from previous steps
and after normalizing the data i have
another code
that transformed the data which is
transform.pi again python the script step
step
so executing done
now that's the time that i'm going to
call another step let me paste the code here
and you can see this time
i am calling again another
python script
code as a step here but this time
the location is different so it is under script
script here
here
train model which is here
and i have a code that split the train
and test even in this step i have
defined them as a separate code to call
them as a pipeline step
the rest is exactly the same
but here i have defined two pipeline
data because this code has two outcomes
one of them the training data and one
with the other one is test data and
define them as data set and i told you
why that's the best practice here
so let me run the code and as you might guess
guess potentially
potentially
that's the time that
we can train the model you're almost
done with all the pre-processing steps
so what we're going to do
any use case you're going on to
developing azure ml any uh codes for
certain use case you're executing you
can put them on under just one
experiment and give it a name so within
that you can isolate different use cases
you're working and and have a better
traceability that's why i'm first
creating an experiment
experiment
and i'm giving it a name just anything
as an example because that's a pipeline
i'm going to create
and now that's the section that i'm
going to train my model but
i'm using automl remember i talked about
automl again in the first video of the
velocity series
you don't need to actually go through
the ui and generate one even in the code
you can do so what you can say just
import rml config and say that hey i'm
going to use automl for creating a
regression model with this iteration
number and delete this timeout my
primary metric to figure out what is the
best model out of all automl generated
models is for example this metric
experiment correlation that can be
changed to anyone you want based on
regression classification and the number
of cross validation here
here i have defined the columns that i
want to have and i'm saying this is a
regression model and i'm saying that i
want to
automl use
this compute do the featurization
automatically if it's needed and that's
what i'm going to predict the cost of
that journey but
but
actually let me execute and explain the
step okay there is something wrong here
okay i'm sorry i forgot to copy the f
okay so now we have created the
configurations needed for automl
but the question is
how we can add this training as a step
to our email pipeline our pipeline now
have all these steps but not this one
yet i'm sorry and this one as well but
not autumnal yet
that's the time that i'm going to do so
now instead of saying python is creepy steps
steps
i am defining in a step but a different
type of stuff which is autoimmune step
this is on a step define azure ml
to call an automl training and execute
at the
trainings there to come up with the best model
model
i'm giving it a name and i am for the
configurations of this automl i'm
refilling what i have created on the top
and the same parameter here
done so now the step is created but
now what i'm going to do
we're all done with our pipeline so what
we can do
now we need to
call those steps that we created before
and then create a pipeline and push
those steps
with doing so we will see that the
pipeline will be built
again nothing has been executed yet
when i do submit
and what i'm going to submit for that
experiment that we define on top i'm
going to submit this pipeline that has
all these steps
that's the moment
the cool part going to be started and
all these steps that we define on the top
top
will be executed based on the sequence
we have to find so i run the cell
and let's see what can happen the
pipeline is built perfect
now it should be executed in the experiment.submit
experiment.submit
popular pipeline submitted for execution
now what i can do i can see the details
of the pipeline because any pipeline
based on the workload can take a while
based on the complete engine the
different types of type of code what i
can do i can actually see the the details
details of
of
the runs and pipeline that i have inside
you can see that it is beautifully
showing all the steps that i need to do
we have defined filter normalize
transform train tests which are not
started here drml this is for the
previous run that i did before
to to do the test before recording this
and here it has you can see it is
beautifully showing me
all the steps automatically visualize
that how i get it data i merge it filter
it oops
normalize it transform all the way until
the time that i'm doing the automotion
which none of them has been started yet
i think
if that's the first time you're doing so
it's going to take a little bit longer
because that compute cluster that i
defined on the top
uh is not on yet so the first time it
will launch the nodes and execute all
the steps
i think i did this before and i should
be able to show you the outcome without
waiting here so if i go on the top of
the previous step
it is giving me a link to that azure
machine learning pipeline that we
and here you can see that it has been beautifully
beautifully
uh it has beautifully created my
um pipeline and visualize that that i
can track what is going on what is
happening you can see all the steps very
completely that when i test the code
and automatically deploy the pipeline
and the cool thing here is
i can publish this pipeline what does
that mean
next time when i want to automatically
retrain my model again or trigger all
these steps i don't need to open these
notebooks and execute them again if i
publish this it will give me an end point
point
and with like an api call
i can call that endpoint which will
trigger all these steps
if you click on publish it will do so i
already did one so i can show you
what's going to happen when you click on
publish so if i go to pipeline python
endpoint you can see that there is a
test that i created with the published
pipeline and if i click on that
it will give me on the right side a rest endpoint
endpoint
that i can use it to call and trigger
all this process again
so the nice thing here is with having
this endpoint wherever you are if you
are in azure function your local machine
wherever that that in that source you
can access to the internet you can just
trigger this pipeline through this
endpoint and you're good to go but also
you can schedule this pipeline or
trigger this pipeline based on a
time-based schedule or even basis
schedule to show you an example in azure
ml documentation you can see that here i
can create a time-based schedule to
trigger this pipeline let's say for
example every 15 minutes
uh based on this time
the pipeline id it will we can have this
from adrenal portal the experiment name
that we defined on the top and then
you're good to go
you basically
you're telling that all these steps
should be triggered automatically every
50 minutes
or not based on time based on the change
if for example within a specific data
store that we defined and within this
path if the data input data has been
changed let's say a new data suddenly
got injected from a source and now we
want to train a new model using automl
and do all those pre-processing because
we just got a new data so as soon as
there's a change in the path you have
defined your blog for example
you will trigger this pipeline
and execute all of them again again in
and the very last thing uh i would say
last but not least is
is
the question that
okay we talked about python is
cryptostep which is a step type that you
can call different python codes we also had
had the
the
automo step that we were able to
config and define and execute
automated machine learning step within
our pipeline but is that all about what
different type of steps we can define
for our pipeline the answer is no
if i go to again azural documentation
here are some of the examples that what
type of steps you can define in azure ml
pipeline it is much much more powerful
than just calling a python code or doing
automl you can see here i can do a lot
of different things as an step in my
azure ml pipeline for example
one of my favorites are is databricks step
step
within your aml pipeline for example
here let's say in this section which is
normalizing the data i want to execute
this step in databricks network and i
have a notebook in databricks there that
i want to execute so with defining this
step as a databricks step
azure ml pipeline when reach to this
step it will trigger a databricks
notebook on under let's say database
cluster that for example is using spark
and then it will execute the rest of the
steps in your own for example azure ml
or any compute places that you have to
find so just telling you that
you can check out the documentation here
based on your use case and a specific
project that you're utilizing
you can leverage all different steps
here one of them again for example
pallet runner step that will allow you
to to parallelize your code execution
over multiple nodes that for sure will
speed up the process and efficiency and
all right we just went through an end to
an example of how we can utilize azure
machine learning to manage and
orchestrate your ml steps and now you
can certainly think about how this tool
can improve your ml journey [Music]
[Music]
is this well
well
you gave me an ml pipeline and i'm
giving you a pipeline to drink your water
water
taste
this is just water
i know but it's all about
interpretation and explanations
explanations yes
yes
wasn't that about our previous video
about responsibility dashboard that we
recorded last time [Music]
[Music]
can we clean the floor this time
sorry i have to edit the videos no [Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.