machi e leaØ i g a d c¡ ÍċöeØ...

66
Machie Leaig ad Ce Vii A ii ag cce ad aicai Feice Adea Peegi 1 Wae Vaea 2 Decebe 13 2017 1 Uiei f Tiee Tiee 2 Gace Vii Techgie S Tiee

Upload: others

Post on 30-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Machine Learning and Computer VisionA mini tour among concepts and applications

Felice Andrea Pellegrino1 Walter Vanzella2

December 13, 2017

1University of Trieste, Trieste

2Glance Vision Technologies Srl, Trieste

Page 2: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Sources

• Course: Li Fei-Fei, Andrej Karpathy, Justin Johnson, and Serena Yeung. CS n: ConvolutionalNeural Networks for Visual Recognition.Stanford University, 2017.http://cs231n.stanford.edu/

• Book: Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.MIT Press, 2016.http://www.deeplearningbook.org [freely available online]

2

Page 3: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Introduction

Page 4: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Definitions

• Machine Learning: the science of getting computers to act without being explicitlyprogrammed.

• Learn from data• Make predictions from “similar” data

• Computer Vision: the science of getting the machines “see”. Computer Vision is concerned withthe automatic extraction, analysis and understanding of useful information from a singleimage or a sequence of images.

3

Page 5: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Examples: Detection

Face detection, (Viola and Jones, 2001)

4

Page 6: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Examples: Recognition

SIFT and object recognition, (Lowe, 1999)

Right whale recognition, Kaggle Challenge,https://www.kaggle.com/c/noaa-right-whale-recognition

5

Page 7: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Examples: Segmentation

(Mnih and Hinton, 2010)

(Ronneberger et al., 2015)

6

Page 8: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Examples: Classification

ImageNet (www.image-net.org)• 14M images• 22K categories

• Animal (fish, bird, mammal, invertebrate)• Plant (tree, flower, vegetable)• Instrumentation (utensile, appliance, tool, musical

instrument)• …

• annual ImageNet Large Scale Visual RecognitionChallenge (ILSVRC)

• object detection within an image may be obtained byapplying image classification to many sub-images.

7

Page 9: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Image Classification: the semantic gap

What the computer sees

Class?• dog• hammer• guitar• cat• dolphin• …

8

Page 10: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Image Classification: the challenges

Illumination Occlusion

Deformation

Background Intraclass variation

9

Page 11: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

A summer project

10

Page 12: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Solutions i

(Marr, 1982)

(Marr, 1982)

11

Page 13: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Solutions ii

“Pictorial structure” - (Fischer and Elschlager, 1973)

• ad-hoc detectors, tailored for the different parts• ad-hoc relation between parts• detection is formulated as an optimization problem:

• maximize the local matching• minimize the springs’ stretching

12

Page 14: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

A data-driven solution: nearest neighbor

A simple idea:

• store all the N pairs of training images and labels (Ij, yj), j = 1 . . .N• define a distance between images d(I1, I2) : I × I −→ R+

• when presented a test image Itest, output label yj where

j = argminj=1...N

d(Itest, Ij)

13

Page 15: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Distances

Examples of distances (Ipj is the p-th pixel of image Ij)

• L1 distance (sum of absolute difference of corresponding pixels)

d(I1, I2) =∑

p|Ip1 − Ip2 |

• L2 distance (Euclidean distance)

d(I1, I2) =

p(Ip1 − Ip2)2

14

Page 16: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Training and predicting

training ≡ collecting examples

• prediction is computationally demanding• need to compare the test instance to all the

collected examples

15

Page 17: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Example of results

CIFAR-10 dataset• 10 labels• 50K training images• 10K test images

First 10 neighbors of the test sample (leftmostcolumn)

• perceptually similar• class is often wrong

16

Page 18: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Problems may arise

• pixel-wise distances usually perform poorly• perceptually similar images may represent very different objects• perceptually different images may represent the same class of objects• the distance itself may not explain perceptual differences:

original translated messed up darkened

Same L2 distance from the original image.

17

Page 19: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Linear classifier

• N images, each containing D pixels• K classes• the training set is

T ={

(xi, yi), xi ∈ RD, yi ∈ {1 . . . K}, i = 1 . . .N

}

• define a function f : RD −→ RK

f(xi,W, b) = Wxi + b

• W: weights matrix• b: bias• fj (the j-th component of f), represents the score of class j

• predicted label for test image xtest is

argmaxj

fj(xtest,W, b)

18

Page 20: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Interpretation

19

Page 21: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Training a linear classifier

• need a principled way to select θ = (W, b)• choose θ in order to minimize a properly defined loss:

• define the loss as

J(θ) = 1N∑

iL(f(xi, θ), yi)

︸ ︷︷ ︸

empirical loss

+ λR(θ)︸ ︷︷ ︸

regulatization loss

• many possible choices for L and R, e.g.

L(f(xi, θ), yi) =∑

j =yi

positive if wrong class hashigher score than true class

︷ ︸︸ ︷

max(0, fj(xi, θ)︸ ︷︷ ︸

score ofwrong class

− fyi (xi, θ)︸ ︷︷ ︸

score oftrue class

)

R(θ) =∑

k

lW2

kl

• training becomes anoptimization problem:

minθ

J(θ)

20

Page 22: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

A step forward: features

• So far:

inputimage Classifier label

• Feature-based approach:

inputimage Feature extractor Classifier label

feat

ure

vect

or21

Page 23: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Histogram of oriented gradients

Histogram of oriented gradients(HoG) (Dalal and Triggs, 2005)

• Outline:• image divided in a grid of cells• HoG computed for each cell• the descriptor is the input of a linear classifier (SVM)

• Many parameters to be tuned:• spatial band of filters• # of angular bins• # of spatial bins• hyperparameters of the SVM

• Some heuristics:• histogram smoothing• local histogram normalization

22

Page 24: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Deformable part model

Deformable part model (Felzenszwalb et al., 2008)

• a data-driven version of (Fischer and Elschlager, 1973)• objects, parts and relationships between parts are learned from data• multiscale:

• “root object” at coarse scale• parts at fine scale

• each part contributes to the final score• many parameters to be tuned 23

Page 25: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Bag of words

Extraction of the visual vocabulary

Training/inference

• aim: find a “standardized”description of the image content

• local descriptors are extractedfrom the training set:

• local patches• SIFT (Lowe, 2004)• …

• clustering produces “visual words”• each class is characterized by a

histogram of visual words• classification becomes a

“comparison of histograms”

24

Page 26: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolution

• most local image features, for instance thedirectional derivatives, can be detected bylinear spatial filtering.

• convolution operator:

g(i, j) =∑

k,l

f(i + k, j + l)h(k, l)

original vertical Prewitt filtered:[

−1 0 1−1 0 1−1 0 1

]

horizontal Prewitt filtered:[

−1 −1 −10 0 01 1 1

]

gradient magnitude√

(

∂I∂x)2

+(

∂I∂y

)2

25

Page 27: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

A bunch of keywords

26

Page 28: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolutional Neural Networks

• In 2012, a Convolutional Neural Network (CNN) won the ImageNet challenge by a wide margin:

Imagenet classification with deep convolutional neural networks, (Krizhevsky et al., 2012)

• a CNN is an Artificial Neural Networks (ANN) having a particular architecture, developedspecifically for dealing with images

• nowadays, CNNs are employed even for dealing with information other than images• in the last few years, the term “Deep Learning”” has become popular to identify the approach

to machine learning based on ANN having many layers (“deep”)

27

Page 29: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Artificial Neural Networks

Page 30: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

A step backward: Artificial Neural Networks I

• a network of artificial neurons (McCulloch and Pitts, 1943)• biological analogy (with caution)• the output of a neuron is obtained by applying an activation function to the weighted sum of

its inputs

28

Page 31: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

A step backward: Artificial Neural Networks II

The perceptron Frank Rosenblatt (1928-1971)

• first trainable Artificial Neural Network: Rosenblatt’s Perceptron (Rosenblatt, 1958)• basic procedure:

• design the architecture (# layers, topology of connections)• choose the (possibly different) activation functions• learn the weights (based on the provided examples)

29

Page 32: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Architectures

Fully connected, 1 hidden layer Fully connected, 2 hidden layers

• left:• 4 + 2 = 6 neurons (input layer is disregarded)• 3× 4 + 4× 2 weights (w) and 4 + 2 biases (b)⇒ 26 parameters

• right:• 4 + 4 + 1 = 9 neurons• 3× 4 + 4× 4 + 4× 1 weights (w) and 4 + 4 + 1 biases (b)⇒ 41 parameters

• a typical CNN has million of parameters (e.g. VGG ha 138M parameters)

30

Page 33: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Activation functions

• output of a neuron

f(∑

i

wixi + b)

• f(·) : R −→ R is the activation function• many possible choices, e.g.

• sigmoidf(x) = 1

1 + e−x

• hyperbolic tangent

f(x) = tanh(x)

• rectified linear unit (ReLU)

f(x) = max {0, x}

• ReLU is currently preferred because• is fast to compute• guarantees faster convergence

sigmoid hyperbolic tangent

ReLU ReLU vs hyperbolic tangent (Krizhevskyet al., 2012)

31

Page 34: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Training as an optimization problem

• N instances, each of dimension D, and K classes• the training set is

T ={

(xi, yi), xi ∈ RD, yi ∈ 1 . . . K, i = 1 . . .N

}

• let the output of the network be

f(x, θ) : RD −→ RK

• define the loss as

J(θ) = 1N∑

i

L(f(xi, θ), yi) + λR(θ)

• many possible choices for L and R, e.g.

L(f(xi, θ), yi) =∑

j=yi

max(0, fj(xi, θ)− fyi(xi, θ) + ∆)

R(θ) =∑

k

l

W2kl

• minimize the loss by a properchoice of θ = (W, b)

• iterative minimization• the instances are presented

repeatedly to the network• each cycle of presentation is

called an “epoch”• the weights and biases are

adapted in such a way to modifythe actual output toward thedesired output

• the most popular training methodsare based on the so calledback-propagation

32

Page 35: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Weight update by gradient descent

• at each step of iteration:• compute an estimate g of the gradient ∇J(θ)• update the weight vector:

θ ← θ − ϵg

• ϵ is called learning rate and is one of the mostimportant learning parameters

33

Page 36: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Stochastic Gradient Descent

• the gradient of the loss function is

∇J(θ) = 1N∇

i

L(f(xi, θ), yi)

• at each iteration, for computing ∇J(θ), the evaluationof the whole training data is required→ very slow forlarge N

• Stochastic Gradient Descent (SGD): at each iteration• sample a minibatch of m examples {x1, . . . , xm} and the

corresponding yi• estimate the gradient as

g =1m∇

iL(f(xi, θ), yi)

• apply the updateθ ← θ − ϵg

• Size of minibatch:• m = 1: gradient direction is

unreliable (based on a singleexample)

• m = N: precise gradient direction,but slow computation

• typical choices for m are 32, 64, 128,to exploit the GPU

34

Page 37: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Improved weight update

• simple update:θ ← θ − ϵg

• momentum:{

v ← αv− ϵgθ ← θ + v

• global adaption of learning rate (“annealing”):• step decay: Reduce the learning rate by some factor

every few epochs.• exponential decay:

ϵ = ϵ0e−kt

• per-parameter learning rate adaption:• AdaGrad• RMSProp• Adam (the most recommended, to date)

35

Page 38: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Computing the gradient ∇J(θ)

• how can we actually compute the gradient ∇J(θ)?• the ANN can be thought of as the composition of many functions:

output = H(input) = [hn ◦ hn−1 ◦ · · · ◦ h0] (input)

• apply the chain rule:(f ◦ g)′(x) = f′(g(x))g′(x)

“the derivative of a composition of functions is the product of the derivatives”

36

Page 39: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Backpropagation

Example of backpropagation for a neuron with sigmoid activation function: σ =1

1 + e−(w0x0+w1x1+w2)

• forward step (from input to output, green): sets the “state” and output of the networkcorresponding to a particular input

• backward step (reverse order, red): by applying the chain rule, computes the derivative of theoutput w.r.t. the input for the given state

• s = 2× (−1) + (−3)× (−2) + (−3) = 1⇒ (1 + e−s)−1 = 0.731• s = (2−0.2)× (−1) + (−3−0.39)× (−2) + (−3+0.2) = 2.18⇒ (1 + e−s)−1 = 0.898← increased

37

Page 40: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolutional Neural Networks

Page 41: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolutional Neural Networks

Example of CNN: a sequence of convolutional layers, pool layers and a final fully connected layer

38

Page 42: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolution Layers i

slide over

all spatial locations

• the 28× 28× 1 “image” on the right is called activation map

• each element of the activation map is obtained by computing wTx + b where w and b are theweights and bias of the filter, and x is a 5× 5× 3 chunk of the image

39

Page 43: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolution Layers ii

slide 6 filters over

all spatial locations

• typically, a bank of filters is employed

• using e.g. 6 filters, leads to a 28× 28× 6 activation map

40

Page 44: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolution Layers iii

…CONV,

ReLU

CONV,

ReLU

CONV,

ReLU

• a ConvNet is a sequence of convolution layers, interspersed with activation functions

41

Page 45: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Convolution Layers iv

Benefits due to convolutional layers:• sparse connectivity• parameter sharing• equivariance to translation

f(T(x)) = T(f(x))

“activation maps of translated image aretranslated activation maps of the originalimage”

Sparse connectivity (top) vs full connectivity (bottom)

Parameter sharing (top) vs individual parameters (bottom)

42

Page 46: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Pooling layer

Pooling principle Max pooling

• pooling layer downsamples the volume spatially, independently in each depth slice of theinput volume

• the most common pooling operator is max

• also used: average pooling

43

Page 47: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

AlexNet

Imagenet classification with deep convolutional neural networks, (Krizhevsky et al., 2012)

• ILSVRC 2012 winner (15.3% top five error)

Convolutional kernels learned from the first convolutional layer of AlexNet

44

Page 48: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Feature visualization i

• aim: understanding how a Convnetworks

• are the activation maps and layers“tuned” to specific patterns?

• which input pattern activates aspecific activation map of aspecific layer?

• “Deconvnet” (Zeiler and Fergus,2014) maps features to pixels(while a convolutional networkdoes the opposite)

A deconvnet layer (left) attached to a convnet layer (right), (Zeiler and Fergus, 2014)

45

Page 49: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Feature visualization ii

Layers 1-3 Layers 4-5

Random subsets of feature maps and corresponding image patches

• layer 2: low level features• layer 3: textures (mainly)

• layer 4: more class-specific• layer 5: entire objects

46

Page 50: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Feature visualization iii

inputimage

Low-levelfeatures

Mid-levelfeatures

High-levelfeatures Classifier label

47

Page 51: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Case studies

Page 52: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Case study: VGGNet

• systematic evaluation of many structures• winner:

• only 3× 3 convolution kernels• stride 1• pad 1• only 2× 2 max pool layers

VGGNet (Simonyan and Zisserman, 2014)

48

Page 53: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Case study: GoogLe Net

GoogLe Net (Szegedy et al., 2015)

• ILSVRC 2014 winner (6.7% top five error)• inception modules:

• “network within a network”• filters with different spatial support in the

same layer

• auxiliary intermediate classifiers (ignored atinference time) Inception module

49

Page 54: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Case study: ResNet

ResNet (He et al., 2016)

ILSVRC 2015 winner (3.6% top five error)

50

Page 55: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Training deep networks

• network depth is of crucial importance• results on ImageNet competition seem to

suggest that ”the deeper, the better” BUT• when the net is “too deep” a degradation

phenomen occurs• the degradation is NOT caused by overfitting

• deeper structures are inherently moredifficult to train

• “proof”:• take a shallow network• add some layers having an equal amount of

inputs and outputs (i.e. layers that couldimplement an identity mapping)

• train both the networks on the same data• the deeper network should produce no

higher training error than its shallowercounterpart, but experiments with currentsolvers show that this is not the case

51

Page 56: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Learning the residuals

• let H(x) be the whole transformation fromtop to bottom

• instead of learning H(x), write

H(x) = x + F(x)

and learn the residuals F(x)

• ResNet blocks learn variations from theidentity

Typical blocks. With a single layer the authors did not observe advantages.

52

Page 57: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Results

• thin curves: training error, bold curves: validation error.• for ResNet, deeper networks show better performance as desired• ResNet-34 has 3.6 billion FLOPS• VGG-19 has 19.6 billion FLOPS

53

Page 58: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Transfer learning

Page 59: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Transfer learning I

inputimage

Low-levelfeatures

Mid-levelfeatures

High-levelfeatures Classifier label

• features may be significant per se (especially features at lower levels)• idea: take a pre-trained network and retrain the classifier only

54

Page 60: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Transfer learning II

1. train on ImageNet 2. for small dataset:treat the CNN as afixed feature extractorand train the classifieronly

3. for medium-sizeddataset: use the old weightsas initialization and retraina bigger portion of the net

55

Page 61: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

Example: transfer learning with VGG

• transfer learning based on VGG net• took the output of the last hidden

layer (4096 elements)• trained a linear Support Vector

Machine

56

Page 62: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

CVPR course

• Course on Computer Vision and Pattern Recognition, starting September, 2018

SEE YOU IN SEPTEMBER!

57

Page 63: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

References i

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In ComputerVision and Pattern Recognition, 5. CVPR 5. IEEE Computer Society Conference on, pages886–893. IEEE, 2005.

Li Fei-Fei, Andrej Karpathy, Justin Johnson, and Serena Yeung. CS n: Convolutional NeuralNetworks for Visual Recognition. Stanford University, 2017. http://cs231n.stanford.edu/.

Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale,deformable part model. In Computer Vision and Pattern Recognition, 8. CVPR 8. IEEEConference on, pages 1–8. IEEE, 2008.

Martin Fischer and Robert Elschlager. The representation and matching of pictorial structure. IEEETrans. Comput, 1:67–92, 1973.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.http://www.deeplearningbook.org.

58

Page 64: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

References ii

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 770–778, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems, pages1097–1105, 2012.

David G Lowe. Object recognition from local scale-invariant features. In Computer Vision, 999. Theproceedings of the seventh IEEE International Conference on, volume 2, pages 1150–1157. Ieee,1999.

David G Lowe. Distinctive image features from scale-invariant keypoints. International journal ofcomputer vision, 60(2):91–110, 2004.

David Marr. Vision: A computational investigation into the human representation and processing ofvisual information. WH San Francisco: Freeman and Company, 1982.

59

Page 65: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

References iii

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.The bulletin of mathematical biophysics, 5(4):115–133, 1943.

Volodymyr Mnih and Geoffrey E Hinton. Learning to detect roads in high-resolution aerial images.In European Conference on Computer Vision, pages 210–223. Springer, 2010.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedicalimage segmentation. In International Conference on Medical Image Computing andComputer-Assisted Intervention, pages 234–241. Springer, 2015.

Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization inthe brain. Psychological review, 65(6):386, 1958.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), June 2015.

60

Page 66: Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡control.units.it/pellegrino/attachments/article/100/2017MLCV_UNITS... · Machi e LeaØ i g a d C¡ ÍċöeØ Vièi¡ A i i ö¡ċØ a ¡

References iv

Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features.In Computer Vision and Pattern Recognition, . CVPR . Proceedings of the IEEEComputer Society Conference on, volume 1, pages I–I. IEEE, 2001.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014.

61