0:00 many occasions we need to deploy a
0:02 machine learning model on cell phone on
0:05 microcontroller or a variable device
0:08 like a fitbit usually machine learning
0:11 models are of bigger size. So if they're
0:13 running in a cloud, on a big machine is
0:15 okay. But if you want to deploy them on
0:18 edge devices by edge devices I mean all
0:20 these devices which I just mentioned
0:22 then we need to optimize the model and
0:25 reduce the model size. So when you reduce
0:28 the model size it fits the requirement
0:31 of a microcontroller. Microcontroller
0:32 might have only few megabytes of memory
0:35 so it meets the requirement of limited
0:37 resources and also the inference is much
0:40 faster in this video. We will look into a
0:42 technique called quantization which is
0:45 used to make basically a big model a
0:48 smaller one so that you can deploy on
0:50 edge devices. We'll go through some
0:52 theory and then we'll
0:54 do coding as well as usual. we will
0:57 convert a tensorflow model into tf
1:00 flight model and will apply quantization.
1:02 Let's Begin!
1:04 Devices like microcontroller wearable
1:06 devices have less memory compared to
1:08 your regular computer and quantization
1:11 is a process of converting a big tf
1:14 model
1:15 into a smaller one so that you can
1:17 deploy on edge devices. by edge devices I
1:20 mean all these devices small devices are
1:23 called edge devices and if you look at
1:26 your
1:27 neural network model
1:29 when you save this model on a disk you
1:31 are essentially saving all the weights
1:34 these weights are float
1:36 sometimes they use float64 precision
1:38 which is eight byte so to store one
1:40 number you are using eight bytes
1:43 sometimes you might be using four bytes.
1:45 So let's say you're using four bytes to
1:47 sort store your one weight and by the
1:50 way I have shown you a very simple
1:52 neural network. Actual neural networks
1:54 are much bigger so many layers so many
1:57 neurons.
1:59 Now if you convert
2:01 this weight into integer
2:04 let's say you are just approximating
2:06 this number from 3.72 to 3, then you can
2:10 reduce your memory storage from 4 byte
2:13 to 1 byte. This is int 8 by the way and
2:16 if you're using 8 bytes and if you go
2:18 from 8 bytes to
2:19 1 byte that is
2:21 that is a huge saving in terms of memory.
2:24 So quantization is basically converting
2:27 all these
2:28 numbers which requires more bytes to
2:31 store in each induced number
2:33 into lets say int. So it's not always int.
2:36 Sometimes you are converting from
2:38 float 64 which is 8 byte to float 16
2:42 which is 2 bytes. Even that case also you
2:44 are reducing the memory size so that is
2:46 basically quantization. It's a simple
2:48 approach. Now you're not blindly
2:50 converting these
2:52 weights into numbers. For example, here
2:55 you have 3.23 you might not be saving
2:58 that as three maybe you are saving it as
3:00 four. There is an algorithm that you have
3:03 to apply and I'm not going to cover that
3:05 you can read the research paper online
3:08 on how exactly quantization works. In
3:10 this video I will just keep it to
3:12 you know a very higher level higher
3:14 level you are basically reducing your
3:17 precision
3:18 and each individual weight that you want
3:21 to store you are using maybe into 8 or
3:23 float 16 so that overall size of the
3:26 model can be reduced and
3:29 obvious benefits are you can deploy your
3:33 model on a microcontroller which might
3:35 have only a few megabytes of memory and
3:37 even the prediction time is much faster.
3:40 So the performance when you're you know
3:42 actually making prediction is much
3:44 faster if your model is let's say into
3:47 eight.
3:49 There are two ways to perform
3:50 quantization in tensorflow post training
3:53 quantization and quantization aware
3:55 training in post training quantization
3:57 you take your trained tf model
4:00 and you use tf light convert. By the way
4:03 if you don't know tf light, tf light is
4:05 used to convert
4:07 these models into smaller ones so that
4:09 you can deploy on edge devices.
4:11 Now when you do this conversation
4:14 you can see this is a bigger circle this
4:16 is little smaller circle. So it will
4:18 already reduce the size
4:21 because the memory format that it is
4:22 using is different
4:24 but if you apply quantization at the
4:27 time of conversion it will make it even
4:30 more smaller. You see the smaller circle
4:32 here on the right hand side.
4:34 Previously it was bigger but when you
4:36 apply quantization the model size is
4:39 much
4:40 smaller.
4:42 Now this is a quick approach but the
4:44 accuracy might get suffered. So the
4:46 better approach is quantization or
4:48 weight training. In this case you take tf
4:51 model then you apply
4:53 quantized model function on it and you
4:56 get a q model in tensorflow. We are
4:58 talking about tensorflow
5:00 and then
5:01 you do
5:02 training again. So this is more like a
5:05 transfer learning you know you are doing
5:08 fine tuning here so you're taking your
5:10 model
5:12 you're doing quantization and on
5:14 quantize model
5:16 you're fine-tuning that you are running
5:18 the training again maybe for fewer
5:20 epochs. And you get fine-tuned quantize
5:23 model. And that
5:25 you convert again using tf light see if
5:28 you want to deploy tensorflow model on
5:31 edge devices you have to use tf light.
5:33 You have to do tf light conversion that
5:36 step cannot be avoided.
5:39 This approach is little more work but it
5:42 gives you a more accuracy. Now let's do
5:44 some coding so that you get an exact
5:46 idea.
5:47 I'm going to use a notebook which I
5:50 created in one of my deep learning
5:52 videos. So if you go to
5:53 YouTube search for code basics deep
5:55 learning you'll find my tutorial
5:57 playlist. Here I made a video on digits
6:00 classification so I have taken a
6:03 notebook from here if you don't know the
6:04 fundamentals, I highly recommend you
6:06 watch this video first and then you
6:08 continue with this particular video. So
6:11 here
6:11 as you can see I have trained a
6:14 handwritten digit classification model
6:16 in tensorflow and then I have exported
6:20 that in into a saved model. See model dot
6:23 save save model
6:25 and that created this same model
6:27 directory and the size
6:29 of this directory is
6:31 around one megabyte I have a very simple
6:33 model but in reality if you're using a
6:36 big complex model the size might go even
6:38 in gigabytes.
6:40 The first approach we're going to
6:42 explore is
6:43 proof training quantization. For that you
6:47 will use
6:48 tdf dot light module so tensorflow has
6:51 this tf light module which allows you to
6:54 convert your model into tf light format.
6:57 You will use tf flight converter format
7:00 and
7:01 a method that you're going to use is
7:03 from
7:05 saved model. So here you can supply
7:08 the directory where you have your saved
7:11 model
7:13 and this will return you a converter and
7:16 you can simply call converter.convert
7:20 and that will
7:22 return you
7:23 a tflight model. So this approach is what
7:28 we discussed during our presentation
7:30 which is without quantization.
7:32 So even if you directly convert using ta
7:34 flight model your model size will be
7:36 little less but if you use quantization
7:38 it will be even more or less.
7:40 So this is without quantity quantization
7:44 and if you look at
7:46 the size
7:48 by the way
7:49 this is just the bytes okay and you can
7:53 get a rough understanding it is around
7:56 312 kilobyte.
7:58 Now
8:00 I will
8:01 use quantization for quantization. Just
8:04 copy paste this code and only add one
8:08 and that line is
8:11 optimizations.
8:13 Optimizations is equal to
8:19 this
8:21 and now you got
8:24 your quant model quantize model
8:27 and the size
8:29 of
8:30 this quantized model
8:32 is much less. It is almost one-fourth so
8:35 by doing this
8:36 you converted this into
8:39 an integer. You converted all the weights
8:42 to integer.
8:43 Okay and if you want to read more about
8:46 this API and what other options you have.
8:49 Here I'm going to link an article in a
8:51 video description below where
8:56 you know we have used this method which
8:58 is just quantizing the weights.
9:00 You can also quantize activations too.
9:03 That will be even more better. That's
9:05 called full integer quantization and you
9:08 have to use this particular code.
9:12 Okay.
9:12 Now let me save
9:15 these
9:16 two models
9:17 into a file. So I'm going to
9:21 just write this model into a file. So I
9:23 will call it.
9:25 I'll first save non-quantized model and
9:28 the extension is tf lite since it's a
9:31 bytes data. I will use right and bytes
9:34 mode
9:35 as f
9:36 and then
9:38 f dot write.
9:39 Well
9:41 this particular one
9:43 and I can copy paste this
9:45 and do the same thing for quantized
9:46 model.
9:47 So
9:49 here here
9:50 and execute it both the files are
9:52 returned here see
9:55 this model is how much 312 kilobytes
9:57 without quantization
9:59 with quantization 82 kilobyte. Hooray!
10:03 1 4 size reduction
10:06 now let's talk about quantization aware
10:09 training.
10:10 Post training quantization is quick but
10:12 the accuracy might get suffered
10:15 with quantization of a training. You can
10:18 get better accuracy you need to first
10:21 import
10:22 the model optimization called tf mode
10:25 from your tensorflow and
10:28 I will use a method
10:30 called quantize model. Okay so I'm going
10:33 to use this method called quantize model
10:36 here
10:37 and let me just save it in a variable so
10:39 that I don't have to write this whole
10:41 thing all the time.
10:43 And I'm going to
10:45 this is basically a function which I am
10:47 going to call on my regular model my
10:49 regular tensorflow model is this you see
10:51 model variable.
10:54 I am applying that quantize model
10:57 method on that
10:59 and I get
11:01 my
11:01 quantization aware model
11:04 so if you go to
11:05 my presentation say this is the first
11:08 step
11:09 on your regular tf model apply quantize
11:12 function you get quantized model. Then
11:15 you have to fine tune this is like
11:16 transfer learning you have to
11:18 run
11:19 training on that model again maybe with
11:22 less epoch.
11:24 So I'm going to
11:26 compile this particular model
11:29 okay and for compile I have used same
11:31 parameter as I used here originally
11:37 and I'll quickly display the summary
11:41 before you know fine tuning you need to
11:43 compile and then
11:44 the summary just shows you know how many
11:47 parameters non renewable trainable and
11:50 so on
11:52 and I will use
11:53 the training
11:54 only for one epoch. Okay I think one
11:57 epoch is good you're already getting 98
12:00 accuracy
12:01 and let's measure that on my
12:06 test data set. Test data set accuracy is
12:08 also like 97 percent so my accuracy lose
12:10 looks
12:12 beautiful
12:13 and now
12:14 I'm going to use again
12:17 the same converter okay but for this
12:20 converter previously we use what from
12:23 save model because we were
12:25 uh
12:26 loading from the disk.
12:28 Here
12:29 I will use a different api
12:31 from keras model so you use from kira's
12:34 model if you are loading an in-memory
12:37 model okay
12:40 that will get you converter and then you
12:43 are using the same
12:44 technique. See converter
12:47 optimizations
12:50 let me do this
12:53 so optimizations. So here you are
12:56 applying quantization here and then you
12:58 are actually running quantization aware
13:01 training so it is two step you first run
13:03 quantization away training and then
13:06 during the tf light conversation you
13:10 the
13:12 quantization. Okay. And I will save it in
13:14 a different variable.
13:18 okay?
13:19 and let's write
13:21 this as well to a file because these are
13:24 just the bytes that you got.
13:26 You need to write it to a file with
13:28 extension tf lite.
13:31 So now what I got so if you go back to
13:34 my
13:35 you know diagram
13:37 you quantize then you do fit for fine
13:39 tuning then you do your ultimate ta
13:41 flight conversion
13:42 okay
13:43 and
13:45 the size of this model
13:48 is 80 kilobyte
13:51 without quantization over training it
13:53 was 82 kilobytes. So now it's
13:55 we are reducing it even further and the
13:58 main benefit of this model is the
14:00 accuracy is a little better compared to
14:02 the other approach that we took. So just
14:04 to quickly summarize
14:06 in this notebook or we train our model
14:09 in a usual way we saved it to our hard
14:11 disk, we saw the size was one megabyte
14:14 then we did post training quantization.
14:17 Our quantized tf light model was around
14:19 300 and
14:21 without quantization it was 312 kilobyte
14:24 then we got 82 kilobyte 84 kilobyte
14:26 model 82 kilobyte actually and then when
14:29 we're in quantization of weight training
14:32 we got
14:33 you know 80 kilobyte model. But the main
14:35 benefit of this model was the training
14:38 accuracy is much better I'm going to
14:40 link few articles in the video
14:42 description below so you can read
14:44 through those
14:45 articles. The purpose of this video was
14:48 just to give you overview of
14:49 quantization.
14:51 This notebook is available in the video
14:52 description below. So friends please try
14:54 it out just by watching video you're not
14:57 going to learn much unless you practice
14:59 on your own. If you like this video
15:01 please give it a thumbs up that is the
15:03 session fees. You know that is this
15:05 training session fees you can do at
15:08 least that much if you don't like it you
15:10 know give it a thumbs down. I'm fine. But
15:11 leave me a comment so that I can improve
15:13 in the future. And share it with your
15:16 friends I have a complete
15:18 deep learning tutorial series by the way.
15:20 You see complete deep learning tutorial
15:23 series which you can benefit from. There
15:25 are so many exercises as well
15:27 and I try to explain things in a simple
15:29 way so share it with your friends who
15:31 wants to learn deep learning. Thank you.