YouTube Transcript:
Azure Machine Learning Pipeline

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content demonstrates how to effectively manage and orchestrate machine learning workflows using Azure Machine Learning pipelines, emphasizing the benefits of breaking down complex processes into modular, reusable steps.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

within your machine learning process

if you don't have a proper pipeline to

orchestrate your ml steps which is from

gathering your data massaging and

pre-processing your data and feed the

machinery model

then unfortunately

this mess will happen to your machine

learning animal ops project and your data

data

wow

what a performance

wait

if you would know what i'm going to talk

about today you would prefer to make the

mess here

rather than your machine learning

project which you have invested your

time budget and resources

what do you mean

okay are you still writing down all your

machine learning steps as just one

single notebook and execute them um

um almost

almost

well yes i mean yeah then

then

that's not the best practice at all

because you cannot manage and scale your

steps properly

oh can you show me an example

of course and

hello everyone this is mg and we're back

with another video session which we are

going to showcase you how you can

leverage azure machine learning

pipelines to manage and orchestrate your

machine learning steps properly

let's go for it all right here is my

azure machine learning workspace and

welcome everyone to my notebooks so what

i'm going to do as you can see i have

already created a folder

called the myc taxi data regression

model and

as you might know this is pretty um

famous actual data set which is coming

from also azure open data set that i'm

going to fetch the data and train the

regression model through applying all

those pre-processing steps you need to

do through leveraging azure machine

learning pipeline so

i have already created a notebook i

assume um you're familiar with azure

machine learning workspace in general if

you're not make sure you check the first

video or part first of azure mlabs video

series in youtube channel that's a

playlist that has 10 parts and the first

part is start with a walkthrough about

azure machine learning workspace and the

components we will discuss about all

these components here

and we have talked about what is compute

how we can create one and how we can

create notebooks what is working

directory here so on and so forth so

assuming that you're already familiar

with and

make sure if you haven't watched the

video first and then

you can resume the video here

so i have a compute i have created

running called mg01 i think that's just

a cpu machine yes and

what i'm going to do i already have the

code i'll paste them here in my notebook and

and

i'll explain step by step what we are doing

doing so

what is the first step

certainly grabbing the data and what i

am doing here i am importing some of the

libraries and packages i need to

to for example work with pandas data

frame and from azure and open data set

as i mentioned you i am getting this nyc

taxi data for green and yellow taxi

and i'll tell you that what we are going

to predict using this data set so here

i'm just simply creating a data frame to

push the data in and then

i have

some date time definitions of from what

time to what time i'm going to grab the

data from this open data set

for this number of months and this would

be the maximum sample size that i want

to have just just to limit

the amount of data that i'm going to

load and that's it so i'm going to grab

the data for green taxi and call it

green df for data frame raw and i'm good

to go so let's execute that

and what we can do i'm actually putting

them separately so the same thing that

we can do exactly for

yellow taxi data the same thing and we

are going to actually merge them

together so as an example in your actual

use cases and projects

you might certainly have more than just

one data set or data store to actually

grab the data

merge them potentially do some

featurization steps on the top to come

up with the clean unified

features to to fit the machine learning

model that's why here we have as an

example more than just one data set the

first one was green taxi and the second

one is yellow taxi the same thing

i was able to create these two

data frames

now what we can do let's take a quick look

look

on these data frames that we just

created so just waiting for it to be

finished all done

and i'm creating a new cell

pasting the code here again

and run

so you can see here i have some

informational features about the number

of the passengers

what was the the time for drop off the

pickup time location the distance of

that the

the distance of the journey

to some latitude like the information

about the location and all the way you

can see one of these columns if i'm not

mistaken is actually about the fare or

cost that

taxi had

uh yes here's the fair amount what we

are going to do

we are going to train the regression

model to actually predict

this value so this would be our target

okay now we took a look at the data and

that's the time i'm going to

download this data locally and then we

want to upload it to a blob that we have

defined before as a blob store as a data store

store

um data store so here what i can do

um i'm just creating a directory locally

based on the type of the data you're

gathering green yellow

uh in a pocket format and you're good to

go so i'm

executing the code

and you can see that the data is written

to local folder

let's figure out where the data is so i

have defined a folder called data that's

here and inside that i should have green

and yellow folders

there you go these are here and inside

them you can see i have pocket files

that i grabbed it from azure open data set

set

um if you remember again from the

part first of azure emblems video series

we were talking about what is data set

and data stored from

azure machine learning

which are basically here data store and

data data store is just

a pointer

to a location that you have your data

there that can be a blob storage that

can be sql database that can be postgres

database uh that can be adless gen 2 or

azure data liquid storage n2 so on and

so forth and by default azure ml

is coming within a storage icon on back

end so

which is the default blob storage i have

defined here and i want to actually

create a data set out of what i downloaded

downloaded

and put in the default data store to

call the data set

let me actually paste the code and

explain it here better

what you can see here

simply with this command line i'm

connecting to my workspace

after doing so

i am actually

calling my default data store which is a blob

blob

and having that as a pointer here and

i'm just uploading

these files that i downloaded

to this target that i have which is my

default data store and that's it so i'm

going to download them there and

and

it's going to take a couple of seconds

probably all done

what are you going to do after now i

need to

again this is certainly best practicing

practice in azure ml

any type of data that you're

using to

to train the models or do any preprocessing

preprocessing

make sure you actually register them as

a data set first and the reason is

with having this defined concept of data

set in azure ml you can version actually

visualize your data set you can

later on track and check out in the logs

what data set was being used to train my

model who created that data set and

whenever we create a data set

automatically azure ml created dashboard

that shows some statistical results of

my data set for example what is the

distribution of each feature what is the

mean max medium standard deviation so on

and so forth so

that being said and these are by the way

almost just some of the very high level

uh benefits of azure data set definition

but again check out that video that i

mentioned earlier

earlier uh but quickly recapping on why

we are registering that here

so my data set that i'm going to create

is a tabular data and i have already

defined in the path that i have

uploaded the data to the default store here

here

and we should be good to go after

and now i can just register it here with

saying this data that i deferred here

define here register within this

workspace that we defined that here

which is my azure machine workspace and

this is the name

that i'm going to define for this data

and

clicking on code actually before i go

further let me show you

what we have done with this registering

as a data set

uh to to make it more tangible so if i

go to data set

you can see that now

the yellow taxi data and green taxi data

is created for me

uh within the date that for example i

was recording the video and who actually

created that so on and so forth

and i told you that even if you click on

explorer it will show you a dashboard

that shows you some of the information about

about

the data if i click on profile

you can see the distribution is here the

mean mag count so on and so forth and

whatever i'm going to do on the top of

this data

it will be all tracked and logged under

the definition of this data set which is

fairly easy for me to follow up and have

traceability on what i'm doing in this

workspace within my machine learning process

process

so going back to the notebook

now you have a better understanding

about what is this data set exactly and

why you utilize that and or register that

that

next thing what i'm going to do

for our pipeline that we're going to execute

execute

we certainly need

a compute engine so here i am creating a

cpu cluster

you can actually create that using the

ui in the compute section again i showed

you in the previous video but here you

can just use the code

and i'm telling that if that compute

engine exists then skip if not create

that and this is the vm sizer type that

i want to have the maximum with four nodes

nodes

and just create inside this workspace

that i have

and this is my huml compute cluster name

and this is the configurations that i

have defined above

and it should be fine i think i've

created this before

yes so found existing cluster so it's

not creating a new one for me

perfect now we got the date we got the

data we registered the data as a data

set and now we have a compute engine as

a cluster to execute our quotes and

pipelines later

so what is next what i would like to do

i am going to actually define

uh my run configurations this is the

code oh no i pasted the same code again

let me copy that again from the other

window and paste it here

all right

so what is run configurations

well certainly for executing any code

any um

even i would say a python script you

certainly need some

python libraries packages that you have

used in your coding pre-processing or

training steps right

so what you can do in azure ml

you can define them basically as an

image like in docker using some

dependencies yaml file and then when you

execute this code anytime in inferencing

or anytime you want to execute this

pipeline again

that compute engine knows what is the

operating system requirement and

packages needed to execute your code

that's why regardless of which computer

i have used to develop this code and

test them when i pass this solution

through a devops process let's say for

staging environment production environment

environment

i wouldn't have necessarily any concerns

that if it fails or not because of the

necessary packages so again it is a

pretty must best practice

that make sure you define the

environment here in azure ml there are

some already predefined environment that

you can use them some packages already

installed there

here i want to manage my own basically

uh environment with adding the

libraries that i want so that's why here

i'm putting false and i should be fine

when i execute this code telling that my

run configuration is created

so that's the environment that i have

defined for running the pipeline that

later i'm going to execute

so what is next and now what i'm going

to do some very high level

pre-processing on the top of the data

again i'm going to showcase you what are

the columns we have and i just want to

select some of the columns that i want

to use for this

use case

and here you go i have

selected the columns that i want

and then

we will have the fun part getting

started so let me copy the code

what i'm going to do here actually

first i have defining

what is the

path that my script exists

and you might question

what are those scripts

remember i just told you that

we do not

create and develop all our codes all the

preprocessing steps in just one single notebook

notebook um

so here what i have done before

and grab the data actually and the

samples from the azure remote github

sample code

here let me find yeah if i go to scripts

this path and pip data

you can see i have all different python

codes that one of them clean the data

one of them filter the data one of the

merge normalize and transform it

so i have each of them on separate

python cones and now i'm going to create

a pipeline

that call these steps within a sequence

that i'm defining in my pipeline and

orchestrate all these steps end to end

and i can automate it

so what i'm doing here i am in high

level defining where where are these

folders python files which are here

i am renaming some of the columns just a

simple renaming and replacing these

values here

and then

i am defining

a pipeline data and the reason is i want to

define the and the

output after that i execute the first

python code which is killing.pi

this is the first step that i'm going to

execute in my azuma pipeline

and definitely there's the outcome

there's a clean data out of the

execution section of this file

so i need to define the output as a

pipeline data because later i want to

pass the output of this file to the next

step which is another code that does the filtering

filtering

so after defining that here what i'm doing

doing

i am defining a step

for my

azure machine learning pipeline what is

this step it's a python description step why

why

because i want to execute a python code

where is it that's actually the name here

here and

and

i want to have the output

as i have defined here on the top

these are some of the arguments that i

want to pass to this python file

which i want to pass the number of the

useful columns i define on above that

will show you then the name of the columns

columns

and the output that we have defined here

let me actually open this file to show

you how this should be defined inside

this clean.pi

so if i open this python code

you can see that i have defined we have

defined actually a function here doing

some pre-processing and then

you can see that here i have defined

some arguments for this python code

which is exactly the same thing that you

just saw in my main code

so i'm showing you that

within the definition of a step for your

azure ml pipeline

you can define some parameters in your

main i mean you clean that by any any

python code you're going to execute and

from your pipeline step

input some

data to define that as a value of the

arguments that you have to find

then i'm thinking that these are the

input data set well for the green taxi

data i'm going to

execute this clean file on the top and

and

just before i forget

this is a very interesting parameter lru

allow reuse which is equal to true that means

means

if i'm going to execute all my pipeline

again then of course the first step will

be executing this python code because

we're defining that here but if the data

is not changed and this is step going to

generate the same outcome then why

should i execute this step again so here

i'm telling to my adrenal pipeline that

if nothing change and when i'm going to

execute all my popcorn again

if the outcome gonna be the same so just

don't execute let's save some costs and

and compute that i have and i skip this

step and go to the next step

and the next step i can check the same

thing and so on and so forth so

that's the first step of my azure

machine pipeline that i defined which is

just simply calling this python code

push some arguments defining the input

and output this is the compute engine

that i want to use to execute this step

what does that mean it means

for each step of your hml pipeline you

can utilize different compute types

for pre-processing i can use cpu machine

which is this machine be defined on the

top but for training i can let's say use

a gpu machine another type of compute

that's how i can granularly scale each

step of my hrml over different computes

and different environments

why because here i'm saying wrong

conflict is equal to ml wrong config we

defined them here on the top if you remember

remember

i can define different packages for

let's say filter.pi merge that by

different ground config environment so i

have full flexibility on that

i know i explained a lot here because it

was very important but for the rest it

should be very fast so very similarly

i'm doing the same

again just showcasing you as an example

the same thing now this time for the

yellow data taxi data instead of the

green one and

almost everything is exactly the same

except just the data frame is different

so did i execute this one no

how did i forget so okay

this one done

and by the way when i execute it doesn't

mean i'm running this python code not at

all i am just defining the steps of my

html pipeline and we haven't finished

the aml pipeline image definition yet

so let's create another cell

and what i'm going to do

let's say after cleaning these data now

i want to merge

this green

taxi data with yellow text data so

that's the next step that i'm going to

have in my pipeline as a step

here is the code

the same thing i am defining this

because i want to have the output of

this merge.pi as a pipeline data to pass

this outcome to the next step

i'm saying that this is my python code

under the same path i defined on the top

the same thing i am passing some

arguments as an input which is my merge

data sorry as an output and these are my inputs

inputs

which are those pocket files and

the same thing i already explained all

of that so just

again a different

step from the same type which is a

now the next one is

filter.pi which is here again you might

guess that we need to use the same

python skip step because yes we are

going to call

another python

script in our pipeline

the same thing again defining that as a

python data

these are my input the arguments that

this script needs so i'm passing them

through here hello reuse true and

and execute

execute

again i'm not running any code i'm just

defining my steps next one

pretty fast normalize that i'm going to

call this python code and define the

outcome as a python data i can just give

it a name and

the same thing the rest is very similar

just the inputs and output are it'll

study different because they're coming

from previous steps

and after normalizing the data i have

another code

that transformed the data which is

transform.pi again python the script step

step

so executing done

now that's the time that i'm going to

call another step let me paste the code here

and you can see this time

i am calling again another

python script

code as a step here but this time

the location is different so it is under script

script here

here

train model which is here

and i have a code that split the train

and test even in this step i have

defined them as a separate code to call

them as a pipeline step

the rest is exactly the same

but here i have defined two pipeline

data because this code has two outcomes

one of them the training data and one

with the other one is test data and

define them as data set and i told you

why that's the best practice here

so let me run the code and as you might guess

guess potentially

potentially

that's the time that

we can train the model you're almost

done with all the pre-processing steps

so what we're going to do

any use case you're going on to

developing azure ml any uh codes for

certain use case you're executing you

can put them on under just one

experiment and give it a name so within

that you can isolate different use cases

you're working and and have a better

traceability that's why i'm first

creating an experiment

experiment

and i'm giving it a name just anything

as an example because that's a pipeline

i'm going to create

and now that's the section that i'm

going to train my model but

i'm using automl remember i talked about

automl again in the first video of the

velocity series

you don't need to actually go through

the ui and generate one even in the code

you can do so what you can say just

import rml config and say that hey i'm

going to use automl for creating a

regression model with this iteration

number and delete this timeout my

primary metric to figure out what is the

best model out of all automl generated

models is for example this metric

experiment correlation that can be

changed to anyone you want based on

regression classification and the number

of cross validation here

here i have defined the columns that i

want to have and i'm saying this is a

regression model and i'm saying that i

want to

automl use

this compute do the featurization

automatically if it's needed and that's

what i'm going to predict the cost of

that journey but

but

actually let me execute and explain the

step okay there is something wrong here

okay i'm sorry i forgot to copy the f

okay so now we have created the

configurations needed for automl

but the question is

how we can add this training as a step

to our email pipeline our pipeline now

have all these steps but not this one

yet i'm sorry and this one as well but

not autumnal yet

that's the time that i'm going to do so

now instead of saying python is creepy steps

steps

i am defining in a step but a different

type of stuff which is autoimmune step

this is on a step define azure ml

to call an automl training and execute

at the

trainings there to come up with the best model

model

i'm giving it a name and i am for the

configurations of this automl i'm

refilling what i have created on the top

and the same parameter here

done so now the step is created but

now what i'm going to do

we're all done with our pipeline so what

we can do

now we need to

call those steps that we created before

and then create a pipeline and push

those steps

with doing so we will see that the

pipeline will be built

again nothing has been executed yet

when i do submit

and what i'm going to submit for that

experiment that we define on top i'm

going to submit this pipeline that has

all these steps

that's the moment

the cool part going to be started and

all these steps that we define on the top

top

will be executed based on the sequence

we have to find so i run the cell

and let's see what can happen the

pipeline is built perfect

now it should be executed in the experiment.submit

experiment.submit

popular pipeline submitted for execution

now what i can do i can see the details

of the pipeline because any pipeline

based on the workload can take a while

based on the complete engine the

different types of type of code what i

can do i can actually see the the details

details of

the runs and pipeline that i have inside

you can see that it is beautifully

showing all the steps that i need to do

we have defined filter normalize

transform train tests which are not

started here drml this is for the

previous run that i did before

to to do the test before recording this

and here it has you can see it is

beautifully showing me

all the steps automatically visualize

that how i get it data i merge it filter

it oops

normalize it transform all the way until

the time that i'm doing the automotion

which none of them has been started yet

i think

if that's the first time you're doing so

it's going to take a little bit longer

because that compute cluster that i

defined on the top

uh is not on yet so the first time it

will launch the nodes and execute all

the steps

i think i did this before and i should

be able to show you the outcome without

waiting here so if i go on the top of

the previous step

it is giving me a link to that azure

machine learning pipeline that we

and here you can see that it has been beautifully

beautifully

uh it has beautifully created my

um pipeline and visualize that that i

can track what is going on what is

happening you can see all the steps very

completely that when i test the code

and automatically deploy the pipeline

and the cool thing here is

i can publish this pipeline what does

that mean

next time when i want to automatically

retrain my model again or trigger all

these steps i don't need to open these

notebooks and execute them again if i

publish this it will give me an end point

point

and with like an api call

i can call that endpoint which will

trigger all these steps

if you click on publish it will do so i

already did one so i can show you

what's going to happen when you click on

publish so if i go to pipeline python

endpoint you can see that there is a

test that i created with the published

pipeline and if i click on that

it will give me on the right side a rest endpoint

endpoint

that i can use it to call and trigger

all this process again

so the nice thing here is with having

this endpoint wherever you are if you

are in azure function your local machine

wherever that that in that source you

can access to the internet you can just

trigger this pipeline through this

endpoint and you're good to go but also

you can schedule this pipeline or

trigger this pipeline based on a

time-based schedule or even basis

schedule to show you an example in azure

ml documentation you can see that here i

can create a time-based schedule to

trigger this pipeline let's say for

example every 15 minutes

uh based on this time

the pipeline id it will we can have this

from adrenal portal the experiment name

that we defined on the top and then

you're good to go

you basically

you're telling that all these steps

should be triggered automatically every

50 minutes

or not based on time based on the change

if for example within a specific data

store that we defined and within this

path if the data input data has been

changed let's say a new data suddenly

got injected from a source and now we

want to train a new model using automl

and do all those pre-processing because

we just got a new data so as soon as

there's a change in the path you have

defined your blog for example

you will trigger this pipeline

and execute all of them again again in

and the very last thing uh i would say

last but not least is

the question that

okay we talked about python is

cryptostep which is a step type that you

can call different python codes we also had

had the

the

automo step that we were able to

config and define and execute

automated machine learning step within

our pipeline but is that all about what

different type of steps we can define

for our pipeline the answer is no

if i go to again azural documentation

here are some of the examples that what

type of steps you can define in azure ml

pipeline it is much much more powerful

than just calling a python code or doing

automl you can see here i can do a lot

of different things as an step in my

azure ml pipeline for example

one of my favorites are is databricks step

step

within your aml pipeline for example

here let's say in this section which is

normalizing the data i want to execute

this step in databricks network and i

have a notebook in databricks there that

i want to execute so with defining this

step as a databricks step

azure ml pipeline when reach to this

step it will trigger a databricks

notebook on under let's say database

cluster that for example is using spark

and then it will execute the rest of the

steps in your own for example azure ml

or any compute places that you have to

find so just telling you that

you can check out the documentation here

based on your use case and a specific

project that you're utilizing

you can leverage all different steps

here one of them again for example

pallet runner step that will allow you

to to parallelize your code execution

over multiple nodes that for sure will

speed up the process and efficiency and

all right we just went through an end to

an example of how we can utilize azure

machine learning to manage and

orchestrate your ml steps and now you

can certainly think about how this tool

can improve your ml journey [Music]

[Music]

is this well

well

you gave me an ml pipeline and i'm

giving you a pipeline to drink your water

water

taste

this is just water

i know but it's all about

interpretation and explanations

explanations yes

yes

wasn't that about our previous video

about responsibility dashboard that we

recorded last time [Music]

[Music]

can we clean the floor this time

sorry i have to edit the videos no [Music]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Azure Machine Learning Pipeline