I’ve always been fascinated by how a machine can detects known objects in an image… Yes it’s wierd… I know…

But if we imagine the range of applications for this kind of algorythms, we may be surprised… From autonomous vehicules to automatic defects detection, crowd counting etc…, they are everywhere…
So I decided to give it a shot, and build several models, with different techniques :
Classification -> 1 image = 1 label
Segmentation -> 1 pixel = 1 label
Localisation / Detection -> object detection and localisation (bounding boxes)
For this purpose, I used Keras and re-implemented the models with it.
It’s been challenging, especially for YOLO ( Localisation ) witch need a custom loss function because it predicts bounding boxes instead of a single label…
You can find them on my GitHub, HERE
Classification is the simplest task and uses only simple convolutional layers before some Dense (fully connected) layers. The number of layers depends on your problem’s complexity…
You can also output more than one label per image (imagine a driving car that decide to go to left and to speed up, from the same picture… like I did before)
Segmentation is as simple as the previous but this time, we will classify the pixels themselves (or little groups of pixels)… So the network should output W x H x C elements with W the image’s width, H the image’s height and C the number of class in witch you want to classify. I used a UNET Keras implementation for this purpose :

Last but not least, maybe the more challenging and interesting in my opinion… YOLO !
You Only Look Once is an algorythm witch split the image into 7 x 7 cells, and will predict the probability of an object having its center in a cell, the width and height of the bounding box, along with a class label and confidence…
So the output of the model should be S x S x (5 + C)) if we detect only one object per cell… (the “5” represents the 4 dimensions (x, y, w, h,) coords of a bounding box and the “confidence of the model in its prediction”).
Example on 3 x 3 grid:

All you have to do is build up your data the good shape and you’r good to go !
Then you need to process the loss by hand beacause Keras can’t do it for you…
Basically, it compute the squared difference between the x, y, w, h values, classes, confidence and ground thuth. It also uses an “Intersection over Union” to consider only one box for a given object…
To be honnest, I took it from here (and did some modifications to makes it fit my data):
https://blog.emmanuelcaradec.com/humble-yolo-implementation-in-keras/
This is an awsome start point !