0:00 last video we looked at how you can
0:02 build a tensorflow input data pipeline
0:04 using tf.data.dataset
0:07 class in this video we will look into
0:09 how you can optimize the performance of
0:11 that
0:12 input pipeline using prefetch and
0:14 caching we'll just go over some theory
0:16 first
0:16 and then we'll write code alright so
0:18 let's get started
0:22 [Music]
0:26 what we discussed in the last video was
0:28 this usually when you have small images
0:31 you load those images into ram from your
0:34 hard disk
0:35 in numpy array pandas data frame and you
0:37 can train your model easily
0:39 but when you have less of 10 million
0:41 images you know when your data set is so
0:43 big
0:43 your computer might give you an
0:45 expiration like this too much data i
0:47 can't handle it
0:49 therefore the tf.data.dataset is quite
0:53 useful
0:54 because it can load this data into
0:56 batches
0:57 and then train the model on those
1:00 batches
1:01 one by one okay so my batch one batch
1:04 two
1:04 and so on now if you look at the same
1:06 exact picture
1:08 in a cpu gpu kind of time view
1:12 where gpus are mainly used for the
1:15 training so when you're doing
1:16 forward pass backward pass doing all
1:19 those matrix manipulations
1:21 those are happening on your gpu so let's
1:24 say my
1:25 loading first batch is the reading from
1:29 disk is being done by cpu so cpu is
1:32 reading all these images
1:33 into my memory it takes less three
1:35 seconds
1:36 then it gives those images to gpu for
1:39 the training and this is batch one
1:41 similarly batch twos take same time and
1:44 if i plot
1:45 a time view of this whole operation
1:48 you will get a graph like this here the
1:51 first
1:52 you know it took three seconds to read
1:54 batch first
1:55 during that time gpu was sitting ideal
1:58 by the way it was not doing anything
2:00 then it took two seconds to train it
2:03 then you
2:04 read second badge so gcp is reading the
2:06 second batch
2:08 and it takes three seconds or all let's
2:10 say if there are three batches it will
2:12 take 15 seconds
2:14 but we can optimize the performance of
2:17 our data pipeline
2:20 by doing this so assume
2:23 that gpu is processing your batch
2:26 one and during that time
2:30 what about cpu reads batch two
2:34 so both of these units are working in
2:36 parallel
2:38 when gpu is training my batch two
2:41 cpu is taking preparing the next batch
2:45 ready so every iteration
2:46 my cpu is keeping the next batch ready
2:49 okay
2:51 and this approach will take you over all
2:54 i think 11 seconds so just compare 11
2:57 second versus 15 seconds so you
2:59 just saved time in your training this
3:04 can be done by
3:05 prefetch api so all you have to do is
3:07 here
3:09 tf.data.data.prefetch 1 1 means
3:12 how many batches i want to prefetch
3:15 so when i say 1 when the gpu is low
3:18 training my batch one it will pull one
3:22 extra batch in the memory if i say
3:24 prefetch two
3:26 it will at this point it will prefetch
3:29 batch two and batch three okay normally
3:32 you will supply
3:33 auto-tune argument so you will let the
3:36 tensorflow framework decide for itself
3:38 how many batches it wants to load uh in
3:41 advance
3:42 okay so you will see people using
3:45 auto-tune or very often and if you look
3:48 at the whole pipeline you know
3:50 and this is what we looked into in our
3:52 last video as well so if you have not
3:53 seen last video
3:55 i highly suggest you guys all watch that
3:58 last video
3:59 so here you will see that measure up to
4:03 the tensorflow code basis
4:04 you will see prefetch being used
4:08 at the very end so you are forming your
4:11 complete pipeline you are saying map
4:13 filter map whatever
4:14 and in the end you will do prefetch so
4:17 that
4:18 both gpu and cpu can work in parallel
4:21 you want to
4:22 make optimal use of your hardware
4:24 resources
4:25 and prefetch allows you to do this
4:29 now we are talking about this map
4:31 operation
4:32 so just think about this you are reading
4:34 all these images
4:36 you are converting them into numpy array
4:39 then you are doing filtering you are
4:40 doing mapping you're doing so much
4:41 processing okay
4:43 and when you're running deep learning
4:46 training
4:47 for multiple epochs you are doing the
4:51 same
4:51 operation multiple times so remember
4:55 one epoch means let's say if i have 10
4:57 million images
5:02 and perform the training that is one
5:05 the second epoch i will repeat the same
5:07 thing i will again go over those 10
5:09 million images
5:11 and i will be doing all these operations
5:13 map filter map
5:14 so do you see some redundancy here
5:17 you're doing
5:18 you're reading the same files and then
5:22 mapping and filtering again
5:26 so this issue can be addressed by cache
5:28 function
5:30 so this is a pictorial representation of
5:33 you know if you're not using any caching
5:35 what will happen is
5:37 you will see here x axis is the time
5:39 okay so you are spending some time
5:41 opening the file
5:45 some time reading it then mapping
5:47 filtering all doing all this
5:49 transformation then you're training
5:51 again you are reading it mapping again
5:53 training okay
5:54 so up till now up till this vertical
5:56 arrow is good but then when the second
5:58 epoch starts
6:00 again you are opening the same set of
6:03 files so let's say you are
6:05 training some kind of text model where
6:07 you're opening one single file which is
6:10 huge which has less than 10 million
6:11 lines in it
6:13 then you have to open the file then
6:15 you're reading the file in chunks
6:17 so let's say you read first 10 000 lines
6:20 then you do mapping you do some
6:22 transformation then you do training
6:23 then again you read next set of 10 000
6:27 lines
6:27 and so on and when the first epoch is
6:30 over
6:30 in the second epoch you again open the
6:34 same file same 10 million line file
6:37 then you take this one is a 10 000 line
6:40 chunk then you do all the transformation
6:43 training again next set of 10 000 lines
6:46 and so on
6:47 so you see some redundancy so these
6:49 operations open read map are redundant
6:52 now this is okay if you have a memory
6:54 problem but let's say if you can fit
6:56 something into memory
6:57 then you can use a cache operation and
6:59 what you can do is now watch carefully
7:01 okay
7:04 now
7:05 look at this particular image
7:08 you see so when i do
7:13 tf.data.dataset.cash
7:14 what what it is going to do is it will
7:17 do all this
7:18 open read map in first epoch
7:21 but for the second epoxy this is the
7:23 second epoch okay
7:24 this one it will just
7:29 train the model so you are saving your
7:31 time in opening and
7:33 reading the file all right let's get
7:34 into coding now
7:36 here is the tensorflow official
7:38 documentation where they have explained
7:40 how you get
7:41 you can get better performance by using
7:45 prefetch
7:45 cache etc and in this example
7:49 they have created this artificial dummy
7:52 data set where you can mimic the
7:55 latencies
7:56 in opening the files reading the files
7:59 etc
8:00 so we're going to use the same example
8:01 here and i have a
8:04 jupyter notebook here and here
8:07 the tensorflow version is 2.5.0 which is
8:11 the latest as of this recording so make
8:13 sure you have a latest version because
8:15 some older versions uh have incompatible
8:19 backward incompatible
8:21 apis now i have modified
8:24 this example a little bit just to make
8:26 it little simple
8:28 so what we do is we are going to create
8:31 a class
8:32 with our tf.data.dataset as a base class
8:38 okay so when you supply this
8:42 as a in the argument it it will derive
8:45 this file dataset class from this
8:48 dataset tf.data.dataset
8:51 and again to remind you what we are
8:53 doing here
8:54 is we are measuring the performance
8:58 we will see how using prefetch you can
9:02 optimize the use of cpu and gpu and you
9:05 can get a better training performance
9:07 and to mimic the real life
9:10 you know latencies in reading files or
9:13 reading objects from the storage
9:15 we are creating this dummy class okay so
9:18 the purpose of this dummy class is to
9:20 mimic the real world scenario let's say
9:22 you are reading files from the disk
9:24 okay and i will say okay reading files
9:27 and matches
9:28 and here you supply number of samples
9:31 that you want to read
9:33 so when you read the file first thing is
9:38 open file okay so let's say open file is
9:41 taking
9:42 you know some time so i'm just mimicking
9:46 i'm just putting dummy timed or slip
9:48 just to mimic the delay in opening the
9:50 file
9:51 then you start reading let's say few
9:55 lines
9:56 chunk by chunk so let's say you have
9:58 million lines in your file
9:59 you want to read first ten thousand
10:02 lines and so on
10:03 so i will say for sample
10:07 index in range
10:16 so i have total listed three samples i'm
10:18 just reading let's say i'm reading
10:20 each line one by one and the delay
10:23 to read each line is let's say
10:26 this much you know point zero one five
10:30 second and
10:33 you are returning that particular sample
10:36 index
10:38 here again this is a dummy class okay in
10:41 real life you will be reading the file
10:43 you will be returning each line so here
10:46 since i'm interested only in measuring
10:48 performance and not the actual content
10:53 yield
10:54 is a generator so if you're not aware
10:57 about generator
10:58 in python go check out my generator
11:01 video so in youtube you can do code
11:03 basics
11:04 python generator you'll get a fairly
11:06 good understanding
11:07 of what generator is then
11:12 we'll override new method so what this
11:14 new method will do
11:16 is this
11:19 let me just show you new called
11:23 okay so whenever you create
11:27 an object of this file data set
11:32 new takes one positional argument okay
11:37 here you have to supply the class
11:42 see new call so whenever
11:45 you create an object of this class this
11:48 particular new method is called
11:50 okay and in this one what i want to do
11:54 is i want to do this tf.data
12:00 data dot data set
12:05 dot from generator so in data set there
12:08 is a method called from
12:10 generator where
12:13 you can say okay class dot
12:16 so this class is the class reference and
12:19 that has this method so
12:21 this is your generator and
12:24 use a output signature is
12:28 the output signature is like what does
12:31 well it returns an integer say
12:35 tuple of integer comma nothing so see
12:38 double so tensor specification
12:42 you will say integer 64 that's what it
12:44 returns
12:46 and the third argument is args is equal
12:49 to number of
12:50 samples so this is the argument
12:53 number of samples you supply into this
12:55 function okay
12:57 so don't worry about this too much if it
12:59 looks complex as we move ahead in the
13:02 code you will understand it better
13:05 okay now what happens is
13:09 um typically when you have a training
13:11 function
13:12 okay let's say whenever you have
13:14 training function
13:18 you will have
13:19 number of epochs listen number of epochs
13:22 is to
13:25 default okay so in usual training loop
13:28 what you do
13:29 is for epoch num
13:32 in number of epochs you go through
13:35 each epoch and you will go through each
13:38 sample in a data set
13:41 okay and you will perform a training so
13:44 let's say the training performance is
13:45 0.01
13:47 so this training performance this dot
13:49 sleep
13:50 okay is basically let me show you
13:54 here is basically this part
13:57 this time this yellow times time slot
13:59 okay
14:00 and this particular time which is
14:03 reading
14:04 the file file lines
14:08 is this and this diagram doesn't have
14:12 this particular
14:13 time but if you want to look at this
14:14 time to read the file
14:16 it is in the other diagram which is this
14:18 see this blue time slot
14:20 okay so i hope that part is clear
14:23 so now your training and
14:27 i'm just introducing artificial delay
14:29 here so here
14:31 i'll just call this function benchmark
14:33 actually because we are benchmarking
14:35 everything
14:36 okay and now
14:42 set
14:43 is an object so when i do this it
14:45 creates an object of this dataset class
14:48 file dataset class and i want to
14:50 benchmark this
14:54 okay and the way you benchmark it is
14:57 by putting this time it
15:01 line magic cell magic okay
15:05 all right let's see number of samples is
15:09 not defined so
15:12 where is it not defined
15:15 let's see where is my number of samples
15:20 i think it's complaining about this
15:22 particular class not having
15:24 this particular method not having that
15:26 argument so by default lesson number of
15:28 sample is three
15:29 so i fix that value here okay getting
15:33 another error values
15:34 must be a signature okay i need to
15:39 return this actually
15:42 because when you do new you're returning
15:44 a class
15:46 still getting an error okay values must
15:48 be a signature what is it
15:51 okay here i need to pass a tuple so
15:53 maybe that's the problem let's see
15:57 now integer object is not ideal for
16:00 epoch number is number
16:02 in number of epochs the number of epochs
16:04 it has to be a range actually
16:07 ah carrying so many others today
16:10 all right it's gonna work this time
16:13 amazing
16:14 so now it's benchmarking the performance
16:17 of file data set
16:18 as is and let's say this is 321 second
16:22 so what just happened is this
16:26 so you read those files in batches so
16:29 while cpu was reading your gpu was
16:32 training
16:32 so you read everything sequentially so
16:34 the performance was
16:36 not that great okay now we are you're
16:39 going to use this
16:41 prefetch api and we'll see how that
16:43 improves the performance
16:45 so just copy paste the same thing here
16:49 and just append this with pre fetch
16:54 and prefetch i'll say prefetch one batch
16:57 or one sample
16:59 now why i can call free prefetch because
17:02 file data set is derived from
17:05 tf.data.dataset
17:06 and this has that prefetch method hence
17:09 i can call it from
17:10 a child class as well and when you
17:13 measure the performance
17:15 you see the improvement 253 milliseconds
17:18 close to 70 millisecond difference you
17:20 see here and if you run it for
17:22 more epochs you will see more difference
17:27 and the popular argument to prefetch is
17:29 auto tune
17:30 so people usually supply auto-tune
17:32 argument
17:34 uh actually it's tf.data.autotune
17:38 okay and that will give you around
17:42 it's like similar performance but this
17:44 autotune will
17:46 figure out on its own how many batches
17:48 it want to
17:50 prefetch while your gpu is training okay
17:52 so i hope
17:53 this is clear if you have any questions
17:55 you know please post in a video comment
17:57 below
17:57 but the idea is very very simple we are
17:59 just implementing
18:01 this particular diagram that you're
18:03 seeing here so previously
18:05 like in this line the operations were
18:08 happening in this order you know
18:10 step by step so cpu and gpu was not
18:13 utilized to its optimal level but then
18:17 by doing pre-fetch while gpu is training
18:20 you are using cpu to pre-fetch your
18:22 previous batch
18:23 and since we have these artificial
18:25 delays introduced here
18:27 you can kind of compare the performance
18:30 of two apis
18:31 if you prefetch let's say two or three
18:33 samples performance not gonna
18:35 change that much okay but
18:39 majority of the time people use this
18:41 auto tune so in our future deep learning
18:43 tutorials you will see
18:44 us using prefetch a lot okay now let's
18:48 talk about
18:49 the cache api okay so cache all right
18:52 what is cache so let's read the
18:54 documentation
18:57 cache api
18:58 caching here
19:03 so here i am reading some documentation
19:05 of for
19:06 for cash api so cache
19:10 what it will do is i think we covered
19:12 this in presentation as well
19:14 where if you're reading the file and
19:17 opening it and mapping it on and and if
19:20 you're running it across multiple epochs
19:23 see for the second epoch you don't need
19:24 to do all this operation so when you do
19:26 dot
19:26 cash you are you you don't see this
19:29 blue and purple blocks here so you're
19:31 saving all the time
19:33 so here we are just going to use
19:35 official tensorflow documentation and
19:38 we'll
19:38 implement that so let's say you are
19:40 creating
19:42 a new data set here
19:45 okay and the data says is nothing but
19:48 just a bunch of numbers and then
19:53 let me do for
19:57 the in data set
20:00 okay print d dot number
20:05 see 0 to 4 number
20:08 and now let's say i want to compute the
20:10 square of that so how do you do that you
20:12 can do
20:13 map and you can so lambda x
20:17 and return mean x square correct
20:21 and that is my data set
20:25 and again if you print this we have
20:27 covered all this in previous videos so
20:29 should be pretty straightforward you're
20:31 just transforming it and you are just
20:33 computing a square of each of these
20:34 numbers
20:36 now if you do cache see
20:39 if i'm running multiple epochs on this
20:42 data set
20:43 then it will have to do this mapping
20:46 multiple times but
20:48 if i just do cache so if i do data set
20:52 is equal to data set
20:53 dot cash if i just do that
20:57 and now when i
21:00 i trade through that so see
21:03 you can i trade through this data set
21:07 using this particular iterator
21:11 see you can do this okay i think you you
21:14 might know about this so if you do
21:16 this let me just quickly show you
21:20 so when you're doing this and the other
21:23 way of doing the same thing would be
21:27 if you just put it in a list you can do
21:29 it same thing in a one line
21:32 so now when you do cash
21:35 it is reading this data from that cash
21:38 when i do it
21:39 execute it a second time it is reading
21:41 it from cash
21:43 you know so it is not
21:46 it is not executing this map function
21:49 again
21:51 if you had um not put this in
21:55 cache then every time you do this it
21:58 will be
22:00 computing this map function again so
22:01 that's the benefit now let's let's apply
22:03 this map function to our original
22:06 file data set
22:10 this guy here okay so
22:13 first i'm going to create some dummy map
22:18 so i will create dummy map function
22:20 again with some type of
22:25 so time delays let's say time dot sleep
22:29 0.03 now if you're using this in a
22:32 tensorflow
22:33 map api you see eventually my goal
22:37 is to use this in i want to create an
22:40 object of file data set
22:43 and then i want to use this map you know
22:46 this map function
22:48 but when you when you pass this here uh
22:51 what happens is let me run it
22:55 you get some error because uh this
22:57 function
22:59 needs some spatial processing so you
23:00 need to wrap this
23:02 in ef dot pi function
23:06 and say lambda x
23:10 or lambda even if you don't supplies
23:12 okay so
23:14 you're supplying um you are
23:17 saying okay this is the sleep and then
23:20 these are the arguments
23:22 okay
23:25 and then you are returning that
23:30 so same string as it is so the whole
23:33 purpose is basically
23:34 if you don't want to worry too much
23:36 about it is seen it's introducing some
23:38 kind of delay
23:39 okay so when you do this
23:43 see now this is working now we will
23:45 benchmark this function
23:49 we'll benchmark it let's say run this
23:51 for five epochs okay
23:54 and i want to time it so this will
23:59 measure the time of this particular cell
24:02 the whole cell
24:03 and minus n1 minus r1 is just around the
24:06 one loop basically
24:10 okay so file let us set dot
24:13 map let's see what is wrong here
24:17 benchmark is not defined
24:21 okay i have a typo here
24:25 so 1.27 second that's what you see here
24:29 and now we'll see how performance can be
24:32 improved
24:33 by using cache so i'm copy pasting same
24:36 code okay
24:37 but after map i'm doing cache
24:41 and when you do that see it takes half
24:43 time
24:44 because it's actually less than half
24:47 time
24:48 you know because this cache what would
24:50 have done is see i'm running it for five
24:52 epochs right
24:53 so uh first epoch
24:57 okay when i call map function it will
25:01 introduce a delay but second time
25:04 the data is cached so second time on our
25:07 second third
25:08 fourth and fifth epoch it is not calling
25:11 this map function
25:13 it is using the map data from the cache
25:16 itself all right so i hope this gave you
25:19 some idea on prefetch and cache prefetch
25:22 and cache is
25:23 something we'll be using in our future
25:25 videos for
25:26 training tensorflow models using
25:29 tf.data.dataset
25:31 if you need more information i'm going
25:33 to provide a link of
25:35 all these awesome tensorflow
25:36 documentation pages so go check out the
25:38 video description
25:40 and also the link of this notebook is in
25:42 video description
25:44 please practice this code practice makes
25:47 the man
25:47 or woman perfect friends so you've got
25:50 to practice this so whatever
25:52 code we went through just practices type
25:54 try to change all these parameters try
25:56 to get a sense or digest
25:58 what you learn today and if you like
26:00 this video please give it a thumbs up
26:03 your your single thumbs up is like a
26:05 freeze of this
26:06 this free class okay so don't forget to
26:08 give that and
26:10 share it with your friends that's also
26:12 important all right
26:13 thank you very much for watching bye