Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Deep learning project end to end | Potato Disease Classification - 2 :Data collection, preprocessing | codebasics | YouTubeToText
YouTube Transcript: Deep learning project end to end | Potato Disease Classification - 2 :Data collection, preprocessing
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content outlines the initial steps of a data science project focused on potato disease classification, detailing the process of acquiring, preparing, and structuring image data for model training using TensorFlow.
Any data science project starts with
data collection process.
AtliQ Agriculture has three options
of collecting data. First, we can use
ready-made data. We can either buy it
from third-party vendor or get it from
kaggle etc. Second option is we can have a
team of data annotators whose job is to
collect these images from farmers and
annotate those images either as a
healthy potato leaves or having a early
or late blight disease. So this team of
annotators
can work with farmers you know they can
maybe go to the
fields, farmer fields and they can ask
farmers to take the pictures or they can
take pictures themselves
and they can classify with the help of
farmer or by some means you know by
domain knowledge that
okay these are classified as
deceased potato plants versus the
healthy potato plants so they can
manually collect the data- this option is
expensive it requires budget so you have
to work with your stakeholders and kind
of get the budget approved and it might
be time consuming as well.
The third option is data scientists can
write web scraping scripts to go through
different websites which has potato
and collect those images and then use
the tools like Docano there are so many
tools that are available which can help
you annotate the data so either you
annotate that or
you get annotated images by using those
web scraping tools.
In this project we are going to use
ready-made data from kaggle.
We will be using this kaggle data set
for our model training you can click on
this download button. It's
326 29 megabyte data whatever
and it has
not only the images for potato disease
classification but
some tomato and pepper this is
classification as well. We are going to
ignore all of this we will just focus on
these three directories so I had already
previously downloaded this zip file when
I right click and extract all I get this
folder. And in this folder
I had the you know tomato all these
directories but those directories I have
deleted so I deleted those directories
manually so I asked you to do the same
thing. Go here
delete all the directories
except these three
then you will copy paste this directory
into your project directory. Now for
project directory I have C code folder
and here I am going to create a new
folder called potato
disease.
So I want all of you to practice this
code along with me. If you just watch my
video it's a waste of your time you
practice as you watch this video only
then it is useful. You know this is the
best advice that someone can give you
okay?
I have this
folder ready for my project
and in that I'm going to create a new
folder called training okay.?
and
Then i'm going to launch get bash so I
have this
get bash
which allows me to run all the unix
commands you can use windows command
prompt as well.
and I will run python minus m notebook
which is gonna launch you know my
jupyter notebook here
and in this
I will locate my potato this is folder
go training
create a new python 3 file and this will
be
my
model okay? So you can say okay training
whatever just give some name to
this particular
notebook
and then
we are going to
import some essential libraries so the
purpose of this video actually is
to download the data set into tf
dataset TF data input pipeline and then
we will do some data cleaning and we
will make our data set ready for model
training. So that's the purpose of this
video.
S here
let me
download some essential
you know
modules here
and then
the first thing I'm going to do is
okay? So we had
this.
Okay so in the download my download
folder
somewhere I had this planned village
directory right? So planned village
directory I'm going to the
do control C
and then
control V here. So I will copy all those
images
into the same folder where I'm running
this
notebook you know my IPV notebook so you
see now I have this directory and if you
look at all this
this is like early blight so there are
thousand images here and if you look at
all these images you see there is there
are these black dots
So this is showing that this
potato plant has some kind of disease. If
you look at healthy
plants
healthy leaves are healthy you know.
There are no blacks spots and they look
pretty good
the other one late blight will also have
late blight is a little more
deteriorating. See you look at all these
leaves they look pretty horrible so we
have all this data here in our directory
and
now I'm going to use tensorflow's
data set
to
[Music]
download these images into
tf.data.dataset.
Now if you don't know about TF data set
you need to pause this video right now
go to Youtube search for tensorflow data
input pipeline and you will see my video.
Here you need to watch this video it
will clarify your concepts. Basically
what's the purpose of tf.data.dataset
let's say you have all these images on
your hard disk okay and you can download
these images into batches
because there could be so many images
right? So if you
read these images and in batches into
this TF data set structure then you can
do like dot filter dot map you can do
amazing things so please watch this
video and I will now assume that your
concepts around
TF data sets are clear and we can now
load that data using
our tf dot
like this this particular API so TF dot
carer dot pre-processing image data set
from directory. Okay now
okay what does this do so you can search
tensorflow
image data set from directory. It will
show you an API for this directory so
you specify a directory first.
So let's say you have
main directory you have your classes and
you these are all the images so this one
call will load all the images into
your tensor. Basically into your data set.
Okay so our so the first argument is
what directory? Okay what is our
directory? Okay so let me write this here
our directory name
is
plant village, correct?
See plant village that's our data
directory
then
um
I will say shuffle is equal to true so
that it will just randomly shuffle the
images and
load them
and then I will say image size
Okay what is my image size?
So let me go here
and open this directories you know like
if you look at this image size you see
256 by 256 all of these images are 256
by 256.
you can verify that.
So
I will say 256 by 256 but I will store
you know I will create couple of
constants where
because I need to refer to this
constants later so I will say okay 256
by 256 is my image size
my batch size you know 32 is kind of
like a standard
batch size I will again store that into
a constant and initialize it here
and that's
pretty much it. I will just say
store this into a data set
okay
okay I did not run this okay
so it loaded two one five two files
belonging to three classes well which
three classes? so you can just do
this dot class names, you know. I will
just store that into a variable so that
I can
refer to it later
and these are the class names basically
your folder names are your class names
See these are the three folder names
and if you look at this. This has
thousand images
the second one has 152
third has thousand so
two thousand one five two
look if I do
length of data set..
Do you have any clue why is it showing
68
just think about pause the video think
about it,
because every element in the data set is
actually a batch of 32 images so if you
do 60 to 8 into 32
see you
the last batch is not perfect so it is
showing little more than two one five
two images but you got an idea
why this is 68?
okay.
let's just
explore this data set so I will say for
images batch
okay for image batch
label badge in
dataset.take.
Yu know when you do this
it gives you
one batch. One batch is how many images?
32 images, okay?
So
I will print
just the
shape of this thing.
I will say shape this
and
labels
bash dot.
I will just do see numpy like
every element that you get is a tensor
so you need to convert that to numpy
again. If you don't know the concept or
refer to the video that I talked about
earlier
and you find that there are 32 images
each image is 256 by 256. Do you know
what is this?
You guys are smart
it's RGB. It's channels basically you
have RGB channels so it's
basically three channels and I'm going
to initialize that as well here
so that
you know I can refer to it little
later
and
these images label batch has you
already realize zero one two. So this is
this is one this is two.
So there are three classes three images
and if you are
you know if you want to print let's say
each individual image. So I will okay
forget about this I will just print
first image this this has 32 images I
will print first image so for our first
image you see it's a tensor.
I you want to convert tensor to a numpy
you do this and you find all these
numbers 3d array
every number is between 0 to 255. the
color is represented bit with 0 to 255
So that's what this is okay
and again
if you do shape of this you'll find 256
by 250 by 3 okay first image
got it all right now
Let's try to visualize these images
okay let's say I want to visualize this
image so I can use
plt dot I am show
so this is matplotlib okay plot c
matplotlib and when you do I am show
it expects 3d array so what is my 3d
area well
my 3d array
is this so I'm printing by the way the
first image okay?
So Numpy
there is some problem so what I need to
do is
it is float so
I converted it to end and
now you should see it working okay
I don't care about all these numbers so
I will just do
you know hide that
and by the way every time it is
shuffling so that's why every time
you're seeing different image because it
has shuffle randomness to it
access is off now I want to display
the label like what image is that
so how do I display that label? Well
you can do plt dot
title
okay
and what is my title? Well
my title is label batch, okay?
This is my title but this will give you
number zero, one, two how can you get the
exit this class name? Well we have class
name so you supply that as an index I
hope you are getting the point
See potato early blight
okay
I want to
display a couple of these images so I
will just
run
a full loop I'll say maybe I want to
display out of you know first batch is
32 I want to display the let's say 12
images
and instead of this I will say I
got it
okay?
I hope that is clear and
if you want to show this in a see if you
run this it's gonna
it's just showing one why because you
need to make a subplot. So sub
plot three by four is like almost like a
matrix
and
if you do this
okay it shows all the images but
the dimension is kind of messed up so I
will just increase the area you know of
each of these images to 10 by 10 and
look wonderful it just shows me all the
images beautifully.
This is healthy leafy, this is early
blight late blight and so on
now we are going to split our data set
into train test split, okay? So let's say
data set length is 68. Actual length is
by the way 68 into 32 because each
element is 32 batch. Okay?
Now
what we will do is we will keep eighty
percent data as training data then we
get remaining twenty percent, right? In
remaining twenty percent we will do
two split so one ten percent split we
will do validation and remaining 10 10
percent will do test. So
this validation set will be used during
the training process
on when you run each epoch after each
epoch you do validation on this 10
percent. Okay? So you run let's say
you know let me define the epoch. So I am
going to
run
50 epochs
this is style and error okay? You could
be 20 30. so we'll
we'll run let's say 50 epochs and
at the end of every epoch we use this
validation data set to do the validation.
Once we are done through 50 epochs, once
we have final model
then we use this 10 person data set. This
is called test data set to measure the
accuracy of our model.
Before we deploy our model into the wild
we want to use this 10 percent
before we deploy our model into the wild
we'll use this 10 test data set to test
the performance
of our
model.
how do you get this split? You know in
sklearn we have trained a split method
trained is split if you use statistical
machine learning in escalant. We have
that. We don't have that intensive flow
we are going to use data set dot take so
when you do data set dot take
okay.
Let's say 10 it will take first 10
samples now.
What is our train size okay so trainings
size is 0.8 because it is 80, okay?
And
what is the length of our data set 68
okay?
I'm going to say okay what is 80 of 68
well 54.
so I can now do?
Take first 54 samples
first 54 batches. Actually each batch is
32 so it's much more simple and call it
a train data set.
Okay?
Okay so
that's my train data set and if you do a
length
I hope you're practicing along with me
you find 54
and if you do data set dot skip 54 it
means you are skipping first 54 and you
are getting remaining 54. You know this
is like if you use the slicing operator
in in Python list it is like
54 and words onwards and this one is
like
first 54
okay? So I hope
if you know Python a little bit this
this should be clear
and
this one.
Okay so first data. Okay? So this will be
my test data set actually this is not
test data set. So this will give you
remaining 10, 20.
In that you need to again split into
validation and test.
Correct?
So
I mean temporarily I will
save it as a test data set but if this
is not axillary S data set
I have 14 and out of that
you know my validation size is what? 10
percent?
Okay
and what I'm doing is
a 10 percent of my actual data set is
so I need six samples basically from my
taste data set and when I do that
I get
my
validation data data set and if you do
validation data set that is six samples
and then you will do skip
and that will be
your
actual test data set. So we
just
split our data set into validation test
and
train data set. Now the code I wrote was
using all the hard coded numbers and you
know
doesn't it's just a prototype so if you
want to wrap all of this into a nice
looking Python function let's call it
this function
and that function. The goal of this
function is
to take
the
tensorflow data set
okay
it should also take what is your split
ratio.
So I'm just saying if you don't supply
anything by default it will say 80 train
10 validation
10 test
and I'm also going to do
shuffle
I'll explain why
and shuffle size is 10 000. If you don't
know about sample size again watch to my
other video that I referred it's very
important
you watch that okay?
Now
what I will return in the end
this
so we are doing whatever code we are
doing. We are just creating a nice
looking Python function, that's it, okay?
So my train size is okay what is my data
set size first of all so my data set
size is this length of data set
then
train size, my training size is
train split like 80 of this and I want
to convert it into integer because see I
don't want to get
these float numbers
that's my train size and my validation
is this
okay?
All right now
my train data set is basically
whatever we did previously which is
you know ds dot take train size
and then when you do ds dot skip
train size you get remaining 20 percent
samples
in that
you will again take validation size
and that's where
you get your validation data set
and if you do the same thing
and just do skip here
you get your
test data set.
Okay so I hope that is clear.
Now,
we have shuffle arguments so if
I want to
just suffer the data set
you know so that
before we split into train test split
the
suffering happens
and seed is just for predictability you
know if you do same seed every time it
will give you same result this is just a
seed number it can be anything it can be
5 7 anything okay my function is ready
and
I can now call my function
on my data set okay what is the name of
my data set here is my data set
you see
data set
so we read all the images into this data
set now we are doing
train test split. Sorry.
Okay see this ran like
it ran so fast
and I will just confirm the size of my
validation
set my test set and so on and they are
coming to be
what we expect it to be actually.
Now once again if you have seen my video
on tensorflow data input pipeline you
would have understood the concepts
behind caching
and pre-fetching etc. So that's what we
are going to do here so we are
the training data set that we have.
We will first do caching this will
you know, it will
read the image from the disk and then
for the next iteration when you need the
same image
it will keep that image in the memory.
So this improves the performance of your
pipeline again. Watch that video because
you will get good understanding on this
shuffle. Okay how shuffle 1000 works
again you need to watch that video so
shuffle 1000 will
again like shuffle the images I think
this since our
yeah it can be less than thousand as
well um
but anyways and then prefetch you know
prefetch if you're using GPU and CPU.
if GPU is busy training
pre-patch will load the next set of
batch from your disk
and that will
improve the performance.
Actually if you look at my
deep learning playlist I have
prefetch and cache
video here. So
you know this this video talks about
prefetch and cache and I can quickly
show you.
So usually
when you are loading
batches you know let's say 32 images at
a time and I have a GPU Titan RTX when
it is training
you know
you are not
reusing CPU when the GPU is training
because CPU is sitting idle then when
you're done now CPU again
reads the batch and GPU is added so this
let's say for this example it takes
around 12 seconds but if you use
prefetch and caching so what's gonna
happen is
see
when you use prefetch and caching
while
GPU is training batch one
CPU will be loading that batch
you see
so that's your prefetch basically
and your cache is something where okay so
this is preface and cache is basically
if you have read an image so see here
I think
usually see
red you read an image so this is that
blue dot
and during the second epoch you are
reading the same images again okay?
Bt if you use
cache
here you don't see this blue
block here so you save time reading
those images so go to.. I will link all
these videos by the way but if you do
code basics deep learning tutorials you
know these are the two videos I am
referring to so back to the
tutorial
once again. So that's what I'm doing and
here
I'm letting tensorflow determine how
many batches to load you know while GPU
is training
and then
you can load this here
okay?
Now
my validation and test data set again
will use the same paradigm
and
now
my these data sets are kind of
optimized for training performance. So my
training will run fast
now we need to do
some pre-processing
you all know if you have worked on any
image processing
you know the first thing we do is scale
so the image the numpy array that we saw
previously was between 0 to 255 you know
it's an RGB scale.
You want to divide that by 255 so that
you get a number between 0
one and the way you do that is by doing
tf dot eras dot
sequential
okay?
And here
I'm supplying my pre-processing
pipeline okay? So
the way you do rescaling is by using
this API. Now don't worry about
experimental by the way this is stable.
okay I had a conversation with
tensorflow folks actually on this
they're saying it is stable.
So don't worry 1.0 there by 255 this
will just scale the image to 255 and we
will supply this layer when we actually
build our model
okay?
We need to do one more thing
which is resizing. We will resize every
image to 256 by 256. so this will resize
the image now you will immediately ask
me our images are already 256 by 256.
Why do we need to resize it?
But
this layer that we are creating okay
See. Let me create this layer so this
resize and rescale layer will eventually
go to
our ultimate model
and when we have a trained model and
when it starts predicting..
during prediction if you're supplying
any image which is not 256 by 256 you
know some different dimension
this will take care of
resizing it. So that's essentially
the idea here.
Once we have created this layer one more
thing we are going to do in terms of
preprocessing is use data augmentation
to
you know make our model
robust.
Let's say
you train a model using some images and
then when you
try predicting
you know at that time if you are
supplying an image which is rotated or
which is not which is different in a
contrast.
Then your model will not perform better
so for that we use a concept of data
augmentation.
In youtube you search for tensorflow
data augmentation you will find my this
my video you must watch that video
what we do in that is let's say you have
this kind of original image in your
training data set you create four new
samples four new training samples out of
that. You apply different transformation
let's say horizontal flip contrast, you
see contrast is
increased in this image. So you're taking
same image you applying some filter, some
contrast,
some transformation you are generating
new training samples.
See here I rotated the images you see
and I will use now all five images for
my training. So I have one image I create
four extra image. I use all five images
for my training so that my model is
robust tomorrow when I start
predicting in wild if someone gives me a
rotated image, my model knows how to
predict that. Okay so that's
the idea behind data augmentation and if
you have seen that video tensorflow
provides
beautiful APIs again you are doing same
thing
where
you are creating couple of layers
I'm going to apply a random flip and
some rotation you know if you watch that
video or the other video you will get a
clear understanding. So that's my
data
augmentation layer which I am going to
store here and by the way resize rescale.
All these
layers I'm going to use ultimately
in my
actual model so I had only this much for
this video in the next video we are
going to
build a model and train it. In this video
just to summarize we loaded our data
into tensorflow data set
we did some visualization then we did
train test split and then we did some
preprocessing. We have not completed
pre-processing we just created layers
for pre-processing. By the way and we
will use these layers into our actual
model.
I hope you're liking it. I hope you are
excited to see the next video coming
where we'll be actually training the
model. It's going to be a lot of fun.
If you're liking this series please
share it with your friends give it a
up
it helps me with Youtube ranking and
this project can go to more people who
are trying to learn and the thing about
Youtube is you know the learning is free.
So if you are doing free learning
at least you can give it a thumbs up you
know. I mean give it a thumbs down if you
don't like it I don't mind it. But if you
give thumbs down please leave a comment
so that I can improve. Thank you for
watching.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.