Instance segmentation is an enhanced type of object detection that generates a segmentation map for each detected instance of an object. Instance segmentation treats individual objects as distinct entities, regardless of the class of the objects. In contrast, semantic segmentation considers all objects of the same class as belonging to a single entity.
Several deep learning algorithms exist to perform instance segmentation. One popular algorithm is Mask R-CNN, which expands on the Faster R-CNN network to perform pixel-level segmentation on the detected objects [1]. The Mask R-CNN algorithm can accommodate multiple classes and overlapping objects.
For an example that shows how to train a Mask R-CNN using Computer Vision Toolbox™, see Multiclass Instance Segmentation using Mask R-CNN.
To train a Mask R-CNN, you need the following data.
Data | Description |
---|---|
RGB image | RGB images that serve as network inputs, specified as H-by-W-by-3 numeric arrays. For example, this sample RGB image is a modified image from the CamVid data set [2] that has been edited to remove personally identifiable information.
|
Ground-truth bounding boxes | Bounding boxes for objects in the RGB images, specified as a NumObjects-by-4 matrix, with rows in the format [x y w h]). For example, the
bboxes = 394 442 36 101 436 457 32 88 619 293 209 281 460 441 210 234 862 375 190 314 816 271 235 305 |
Instance labels | Label of each instance, specified as a NumObjects-by-1 string vector or a NumObjects-by-1 cell array of character vectors.) For example, the labels = 6×1 cell array {'Person' } {'Person' } {'Vehicle'} {'Vehicle'} {'Vehicle'} {'Vehicle'} |
Instance masks | Masks for instances of objects. Mask data comes in two formats:
For example, this montage shows the binary masks of six objects in the sample RGB image. |
To display the instance masks over the image, use the insertObjectMask
. You can specify a color map so that each instance
appears in a different color. This sample code shows how display the instance masks in
the masks
variable over the RGB image in the im
variable using the lines
color map.
imOverlay = insertObjectMask(im,masks,'Color',lines(numObjects));
imshow(imOverlay);
To show the bounding boxes with labels over the image, use the showShape
function. This sample code shows how to show labeled rectangular shapes with bounding
box size and position data in the bboxes
variable and label data in
the labels
variable.
imshow(imOverlay) showShape("rectangle",bboxes,"Label",labels,"Color","red");
Use a datastore to read data. The datastore must return data as a 1-by-4 cell
array in the format {RGB images, bounding boxes, labels, masks}. The size of the
images, bounding boxes, and masks must match the input size of the network. If you
need to resize the data, then you can use the imresize
to resize the RGB images
and masks, and the bboxresize
function to resize the bounding boxes.
For more information, see Datastores for Deep Learning (Deep Learning Toolbox).
Training a Mask R-CNN network requires a custom training loop. To manage the
mini-batching of observations in a custom training loop, create a minibatchqueue
(Deep Learning Toolbox) object from the datastore. The
minibatchqueue
object casts data to a dlarray
(Deep Learning Toolbox)
object that enables auto differentiation in deep learning applications. If you have
a supported GPU, then a minibatchqueue
object also moves data to
the GPU.
The next
(Deep Learning Toolbox)
function yields the next mini-batch of data from the
minibatchqueue
.
The Mask R-CNN network consists of two stages. The first is a region proposal network (RPN), which predicts object proposal bounding boxes based on anchor boxes. The second stage is an R-CNN detector that refines these proposals, classifies them, and computes the pixel-level segmentation for these proposals.
The Mask R-CNN model builds on the Faster R-CNN model, which you can create using
fasterRCNNLayers
. Replace the ROI max pooling layer with an roiAlignLayer
that provides more accurate sub-pixel level ROI pooling.
The Mask R-CNN network also adds a mask branch for pixel level object segmentation. For
more information about the Faster R-CNN network, see Getting Started with R-CNN, Fast R-CNN, and Faster R-CNN.
This diagram shows a modified Faster R-CNN network on the left and a mask branch on the right.
Train the model in a custom training loop. For each iteration:
Read the data for current mini-batch using the next
(Deep Learning Toolbox) function.
Evaluate the model gradients using the dlfeval
(Deep Learning Toolbox) function and a custom helper function that calculates
the gradients and overall loss for batches of training data.
Update the network learnable parameters using a function such as adamupdate
(Deep Learning Toolbox) or sgdmupdate
(Deep Learning Toolbox).
For an example that shows how to train a Mask R-CNN using Computer Vision Toolbox, see Multiclass Instance Segmentation using Mask R-CNN.
[1] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask R-CNN.” ArXiv:1703.06870 [Cs], January 24, 2018. https://arxiv.org/pdf/1703.06870.
[2] Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. “Semantic Object Classes in Video: A High-Definition Ground Truth Database.” Pattern Recognition Letters 30, no. 2 (January 2009): 88–97. https://doi.org/10.1016/j.patrec.2008.04.005.
roiAlignLayer
| minibatchqueue
(Deep Learning Toolbox)