YouTube Transcript:
Optimize Tensorflow Pipeline Performance: prefetch & cache | Deep Learning Tutorial 45 (Tensorflow)

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content explains how to optimize TensorFlow input data pipelines using prefetch and cache to improve training performance by enabling parallel CPU and GPU operations and avoiding redundant computations across epochs.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

last video we looked at how you can

build a tensorflow input data pipeline

using tf.data.dataset

class in this video we will look into

how you can optimize the performance of

that

input pipeline using prefetch and

caching we'll just go over some theory

first

and then we'll write code alright so

let's get started

[Music]

what we discussed in the last video was

this usually when you have small images

you load those images into ram from your

hard disk

in numpy array pandas data frame and you

can train your model easily

but when you have less of 10 million

images you know when your data set is so

big

your computer might give you an

expiration like this too much data i

can't handle it

therefore the tf.data.dataset is quite

useful

because it can load this data into

batches

and then train the model on those

batches

one by one okay so my batch one batch

two

and so on now if you look at the same

exact picture

in a cpu gpu kind of time view

where gpus are mainly used for the

training so when you're doing

forward pass backward pass doing all

those matrix manipulations

those are happening on your gpu so let's

say my

loading first batch is the reading from

disk is being done by cpu so cpu is

reading all these images

into my memory it takes less three

seconds

then it gives those images to gpu for

the training and this is batch one

similarly batch twos take same time and

if i plot

a time view of this whole operation

you will get a graph like this here the

first

you know it took three seconds to read

batch first

during that time gpu was sitting ideal

by the way it was not doing anything

then it took two seconds to train it

then you

read second badge so gcp is reading the

second batch

and it takes three seconds or all let's

say if there are three batches it will

take 15 seconds

but we can optimize the performance of

our data pipeline

by doing this so assume

that gpu is processing your batch

one and during that time

what about cpu reads batch two

so both of these units are working in

parallel

when gpu is training my batch two

cpu is taking preparing the next batch

ready so every iteration

my cpu is keeping the next batch ready

okay

and this approach will take you over all

i think 11 seconds so just compare 11

second versus 15 seconds so you

just saved time in your training this

can be done by

prefetch api so all you have to do is

here

tf.data.data.prefetch 1 1 means

how many batches i want to prefetch

so when i say 1 when the gpu is low

training my batch one it will pull one

extra batch in the memory if i say

prefetch two

it will at this point it will prefetch

batch two and batch three okay normally

you will supply

auto-tune argument so you will let the

tensorflow framework decide for itself

how many batches it wants to load uh in

advance

okay so you will see people using

auto-tune or very often and if you look

at the whole pipeline you know

and this is what we looked into in our

last video as well so if you have not

seen last video

i highly suggest you guys all watch that

last video

so here you will see that measure up to

the tensorflow code basis

you will see prefetch being used

at the very end so you are forming your

complete pipeline you are saying map

filter map whatever

and in the end you will do prefetch so

that

both gpu and cpu can work in parallel

you want to

make optimal use of your hardware

resources

and prefetch allows you to do this

now we are talking about this map

operation

so just think about this you are reading

all these images

you are converting them into numpy array

then you are doing filtering you are

doing mapping you're doing so much

processing okay

and when you're running deep learning

training

for multiple epochs you are doing the

same

operation multiple times so remember

one epoch means let's say if i have 10

million images

and perform the training that is one

the second epoch i will repeat the same

thing i will again go over those 10

million images

and i will be doing all these operations

map filter map

so do you see some redundancy here

you're doing

you're reading the same files and then

mapping and filtering again

so this issue can be addressed by cache

function

so this is a pictorial representation of

you know if you're not using any caching

what will happen is

you will see here x axis is the time

okay so you are spending some time

opening the file

some time reading it then mapping

filtering all doing all this

transformation then you're training

again you are reading it mapping again

training okay

so up till now up till this vertical

arrow is good but then when the second

epoch starts

again you are opening the same set of

files so let's say you are

training some kind of text model where

you're opening one single file which is

huge which has less than 10 million

lines in it

then you have to open the file then

you're reading the file in chunks

so let's say you read first 10 000 lines

then you do mapping you do some

transformation then you do training

then again you read next set of 10 000

lines

and so on and when the first epoch is

over

in the second epoch you again open the

same file same 10 million line file

then you take this one is a 10 000 line

chunk then you do all the transformation

training again next set of 10 000 lines

and so on

so you see some redundancy so these

operations open read map are redundant

now this is okay if you have a memory

problem but let's say if you can fit

something into memory

then you can use a cache operation and

what you can do is now watch carefully

okay

now

look at this particular image

you see so when i do

tf.data.dataset.cash

what what it is going to do is it will

do all this

open read map in first epoch

but for the second epoxy this is the

second epoch okay

this one it will just

train the model so you are saving your

time in opening and

reading the file all right let's get

into coding now

here is the tensorflow official

documentation where they have explained

how you get

you can get better performance by using

prefetch

cache etc and in this example

they have created this artificial dummy

data set where you can mimic the

latencies

in opening the files reading the files

etc

so we're going to use the same example

here and i have a

jupyter notebook here and here

the tensorflow version is 2.5.0 which is

the latest as of this recording so make

sure you have a latest version because

some older versions uh have incompatible

backward incompatible

apis now i have modified

this example a little bit just to make

it little simple

so what we do is we are going to create

a class

with our tf.data.dataset as a base class

okay so when you supply this

as a in the argument it it will derive

this file dataset class from this

dataset tf.data.dataset

and again to remind you what we are

doing here

is we are measuring the performance

we will see how using prefetch you can

optimize the use of cpu and gpu and you

can get a better training performance

and to mimic the real life

you know latencies in reading files or

reading objects from the storage

we are creating this dummy class okay so

the purpose of this dummy class is to

mimic the real world scenario let's say

you are reading files from the disk

okay and i will say okay reading files

and matches

and here you supply number of samples

that you want to read

so when you read the file first thing is

open file okay so let's say open file is

taking

you know some time so i'm just mimicking

i'm just putting dummy timed or slip

just to mimic the delay in opening the

file

then you start reading let's say few

lines

chunk by chunk so let's say you have

million lines in your file

you want to read first ten thousand

lines and so on

so i will say for sample

index in range

so i have total listed three samples i'm

just reading let's say i'm reading

each line one by one and the delay

to read each line is let's say

this much you know point zero one five

second and

you are returning that particular sample

index

here again this is a dummy class okay in

real life you will be reading the file

you will be returning each line so here

since i'm interested only in measuring

performance and not the actual content

yield

is a generator so if you're not aware

about generator

in python go check out my generator

video so in youtube you can do code

basics

python generator you'll get a fairly

good understanding

of what generator is then

we'll override new method so what this

new method will do

is this

let me just show you new called

okay so whenever you create

an object of this file data set

new takes one positional argument okay

here you have to supply the class

see new call so whenever

you create an object of this class this

particular new method is called

okay and in this one what i want to do

is i want to do this tf.data

data dot data set

dot from generator so in data set there

is a method called from

generator where

you can say okay class dot

so this class is the class reference and

that has this method so

this is your generator and

use a output signature is

the output signature is like what does

well it returns an integer say

tuple of integer comma nothing so see

double so tensor specification

you will say integer 64 that's what it

returns

and the third argument is args is equal

to number of

samples so this is the argument

number of samples you supply into this

function okay

so don't worry about this too much if it

looks complex as we move ahead in the

code you will understand it better

okay now what happens is

um typically when you have a training

function

okay let's say whenever you have

training function

you will have

number of epochs listen number of epochs

is to

default okay so in usual training loop

what you do

is for epoch num

in number of epochs you go through

each epoch and you will go through each

sample in a data set

okay and you will perform a training so

let's say the training performance is

0.01

so this training performance this dot

sleep

okay is basically let me show you

here is basically this part

this time this yellow times time slot

okay

and this particular time which is

reading

the file file lines

is this and this diagram doesn't have

this particular

time but if you want to look at this

time to read the file

it is in the other diagram which is this

see this blue time slot

okay so i hope that part is clear

so now your training and

i'm just introducing artificial delay

here so here

i'll just call this function benchmark

actually because we are benchmarking

everything

okay and now

set

is an object so when i do this it

creates an object of this dataset class

file dataset class and i want to

benchmark this

okay and the way you benchmark it is

by putting this time it

line magic cell magic okay

all right let's see number of samples is

not defined so

where is it not defined

let's see where is my number of samples

i think it's complaining about this

particular class not having

this particular method not having that

argument so by default lesson number of

sample is three

so i fix that value here okay getting

another error values

must be a signature okay i need to

return this actually

because when you do new you're returning

a class

still getting an error okay values must

be a signature what is it

okay here i need to pass a tuple so

maybe that's the problem let's see

now integer object is not ideal for

epoch number is number

in number of epochs the number of epochs

it has to be a range actually

ah carrying so many others today

all right it's gonna work this time

amazing

so now it's benchmarking the performance

of file data set

as is and let's say this is 321 second

so what just happened is this

so you read those files in batches so

while cpu was reading your gpu was

training

so you read everything sequentially so

the performance was

not that great okay now we are you're

going to use this

prefetch api and we'll see how that

improves the performance

so just copy paste the same thing here

and just append this with pre fetch

and prefetch i'll say prefetch one batch

or one sample

now why i can call free prefetch because

file data set is derived from

tf.data.dataset

and this has that prefetch method hence

i can call it from

a child class as well and when you

measure the performance

you see the improvement 253 milliseconds

close to 70 millisecond difference you

see here and if you run it for

more epochs you will see more difference

and the popular argument to prefetch is

auto tune

so people usually supply auto-tune

argument

uh actually it's tf.data.autotune

okay and that will give you around

it's like similar performance but this

autotune will

figure out on its own how many batches

it want to

prefetch while your gpu is training okay

so i hope

this is clear if you have any questions

you know please post in a video comment

below

but the idea is very very simple we are

just implementing

this particular diagram that you're

seeing here so previously

like in this line the operations were

happening in this order you know

step by step so cpu and gpu was not

utilized to its optimal level but then

by doing pre-fetch while gpu is training

you are using cpu to pre-fetch your

previous batch

and since we have these artificial

delays introduced here

you can kind of compare the performance

of two apis

if you prefetch let's say two or three

samples performance not gonna

change that much okay but

majority of the time people use this

auto tune so in our future deep learning

tutorials you will see

us using prefetch a lot okay now let's

talk about

the cache api okay so cache all right

what is cache so let's read the

documentation

cache api

caching here

so here i am reading some documentation

of for

for cash api so cache

what it will do is i think we covered

this in presentation as well

where if you're reading the file and

opening it and mapping it on and and if

you're running it across multiple epochs

see for the second epoch you don't need

to do all this operation so when you do

dot

cash you are you you don't see this

blue and purple blocks here so you're

saving all the time

so here we are just going to use

official tensorflow documentation and

we'll

implement that so let's say you are

creating

a new data set here

okay and the data says is nothing but

just a bunch of numbers and then

let me do for

the in data set

okay print d dot number

see 0 to 4 number

and now let's say i want to compute the

square of that so how do you do that you

can do

map and you can so lambda x

and return mean x square correct

and that is my data set

and again if you print this we have

covered all this in previous videos so

should be pretty straightforward you're

just transforming it and you are just

computing a square of each of these

numbers

now if you do cache see

if i'm running multiple epochs on this

data set

then it will have to do this mapping

multiple times but

if i just do cache so if i do data set

is equal to data set

dot cash if i just do that

and now when i

i trade through that so see

you can i trade through this data set

using this particular iterator

see you can do this okay i think you you

might know about this so if you do

this let me just quickly show you

so when you're doing this and the other

way of doing the same thing would be

if you just put it in a list you can do

it same thing in a one line

so now when you do cash

it is reading this data from that cash

when i do it

execute it a second time it is reading

it from cash

you know so it is not

it is not executing this map function

again

if you had um not put this in

cache then every time you do this it

will be

computing this map function again so

that's the benefit now let's let's apply

this map function to our original

file data set

this guy here okay so

first i'm going to create some dummy map

so i will create dummy map function

again with some type of

so time delays let's say time dot sleep

0.03 now if you're using this in a

tensorflow

map api you see eventually my goal

is to use this in i want to create an

object of file data set

and then i want to use this map you know

this map function

but when you when you pass this here uh

what happens is let me run it

you get some error because uh this

function

needs some spatial processing so you

need to wrap this

in ef dot pi function

and say lambda x

or lambda even if you don't supplies

okay so

you're supplying um you are

saying okay this is the sleep and then

these are the arguments

okay

and then you are returning that

so same string as it is so the whole

purpose is basically

if you don't want to worry too much

about it is seen it's introducing some

kind of delay

okay so when you do this

see now this is working now we will

benchmark this function

we'll benchmark it let's say run this

for five epochs okay

and i want to time it so this will

measure the time of this particular cell

the whole cell

and minus n1 minus r1 is just around the

one loop basically

okay so file let us set dot

map let's see what is wrong here

benchmark is not defined

okay i have a typo here

so 1.27 second that's what you see here

and now we'll see how performance can be

improved

by using cache so i'm copy pasting same

code okay

but after map i'm doing cache

and when you do that see it takes half

time

because it's actually less than half

time

you know because this cache what would

have done is see i'm running it for five

epochs right

so uh first epoch

okay when i call map function it will

introduce a delay but second time

the data is cached so second time on our

second third

fourth and fifth epoch it is not calling

this map function

it is using the map data from the cache

itself all right so i hope this gave you

some idea on prefetch and cache prefetch

and cache is

something we'll be using in our future

videos for

training tensorflow models using

tf.data.dataset

if you need more information i'm going

to provide a link of

all these awesome tensorflow

documentation pages so go check out the

video description

and also the link of this notebook is in

video description

please practice this code practice makes

the man

or woman perfect friends so you've got

to practice this so whatever

code we went through just practices type

try to change all these parameters try

to get a sense or digest

what you learn today and if you like

this video please give it a thumbs up

your your single thumbs up is like a

freeze of this

this free class okay so don't forget to

give that and

share it with your friends that's also

important all right

thank you very much for watching bye

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Optimize Tensorflow Pipeline Performance: prefetch & cache | Deep Learning Tutorial 45 (Tensorflow)

AutoDub

Video Transcript

Summary

Core Theme

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Optimize Tensorflow Pipeline Performance: prefetch & cache | Deep Learning Tutorial 45 (Tensorflow)