machi e leaØ i g a d c¡ ÍċöeØ...

Machine Learning and Computer VisionA mini tour among concepts and applications

Felice Andrea Pellegrino1 Walter Vanzella2

December 13, 2017

1University of Trieste, Trieste

2Glance Vision Technologies Srl, Trieste

Sources

• Course: Li Fei-Fei, Andrej Karpathy, Justin Johnson, and Serena Yeung. CS n: ConvolutionalNeural Networks for Visual Recognition.Stanford University, 2017.http://cs231n.stanford.edu/

• Book: Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.MIT Press, 2016.http://www.deeplearningbook.org [freely available online]

2

http://cs231n.stanford.edu/

http://www.deeplearningbook.org

Introduction

Definitions

• Machine Learning: the science of getting computers to act without being explicitlyprogrammed.

• Learn from data• Make predictions from “similar” data

• Computer Vision: the science of getting the machines “see”. Computer Vision is concerned withthe automatic extraction, analysis and understanding of useful information from a singleimage or a sequence of images.

3

Examples: Detection

Face detection, (Viola and Jones, 2001)

4

Examples: Recognition

SIFT and object recognition, (Lowe, 1999)

Right whale recognition, Kaggle Challenge,https://www.kaggle.com/c/noaa-right-whale-recognition

5

https://www.kaggle.com/c/noaa-right-whale-recognition

https://www.kaggle.com/c/noaa-right-whale-recognition

Examples: Segmentation

(Mnih and Hinton, 2010)

(Ronneberger et al., 2015)

6

Examples: Classification

ImageNet (www.image-net.org)• 14M images• 22K categories

• Animal (fish, bird, mammal, invertebrate)• Plant (tree, flower, vegetable)• Instrumentation (utensile, appliance, tool, musical

instrument)• …

• annual ImageNet Large Scale Visual RecognitionChallenge (ILSVRC)

• object detection within an image may be obtained byapplying image classification to many sub-images.

7

www.image-net.org

Image Classification: the semantic gap

What the computer sees

Class?• dog• hammer• guitar• cat• dolphin• …

8

Image Classification: the challenges

Illumination Occlusion

Deformation

Background Intraclass variation

9

A summer project

10

Solutions i

(Marr, 1982)

(Marr, 1982)

11

Solutions ii

“Pictorial structure” - (Fischer and Elschlager, 1973)

• ad-hoc detectors, tailored for the different parts• ad-hoc relation between parts• detection is formulated as an optimization problem:

• maximize the local matching• minimize the springs’ stretching

12

A data-driven solution: nearest neighbor

A simple idea:

• store all the N pairs of training images and labels (Ij, yj), j = 1 . . .N• define a distance between images d(I1, I2) : I × I −→ R+

• when presented a test image Itest, output label yj where

j = argminj=1...N

d(Itest, Ij)

13

Distances

Examples of distances (Ipj is the p-th pixel of image Ij)

• L1 distance (sum of absolute difference of corresponding pixels)

d(I1, I2) =∑

p|Ip1 − Ip2 |

• L2 distance (Euclidean distance)

d(I1, I2) =

√

∑

p(Ip1 − Ip2)2

14

Training and predicting

training ≡ collecting examples

• prediction is computationally demanding• need to compare the test instance to all the

collected examples

15

Example of results

CIFAR-10 dataset• 10 labels• 50K training images• 10K test images

First 10 neighbors of the test sample (leftmostcolumn)

• perceptually similar• class is often wrong

16

Problems may arise

• pixel-wise distances usually perform poorly• perceptually similar images may represent very different objects• perceptually different images may represent the same class of objects• the distance itself may not explain perceptual differences:

original translated messed up darkened

Same L2 distance from the original image.

17

Linear classifier

• N images, each containing D pixels• K classes• the training set is

T ={

(xi, yi), xi ∈ RD, yi ∈ {1 . . . K}, i = 1 . . .N

}

• define a function f : RD −→ RK

f(xi,W, b) = Wxi + b

• W: weights matrix• b: bias• fj (the j-th component of f), represents the score of class j

• predicted label for test image xtest is

argmaxj

fj(xtest,W, b)

18

Interpretation

19

Training a linear classifier

• need a principled way to select θ = (W, b)• choose θ in order to minimize a properly defined loss:

• define the loss as

J(θ) = 1N∑

iL(f(xi, θ), yi)

︸︷︷︸

empirical loss

+ λR(θ)︸︷︷︸

regulatization loss

• many possible choices for L and R, e.g.

L(f(xi, θ), yi) =∑

j =yi

positive if wrong class hashigher score than true class

︷︸︸︷

max(0, fj(xi, θ)︸︷︷︸

score ofwrong class

− fyi (xi, θ)︸︷︷︸

score oftrue class

)

R(θ) =∑

k

∑

lW2

kl

• training becomes anoptimization problem:

minθ

J(θ)

20

A step forward: features

• So far:

inputimage Classifier label

• Feature-based approach:

inputimage Feature extractor Classifier label

feat

ure

vect

or21

Histogram of oriented gradients

Histogram of oriented gradients(HoG) (Dalal and Triggs, 2005)

• Outline:• image divided in a grid of cells• HoG computed for each cell• the descriptor is the input of a linear classifier (SVM)

• Many parameters to be tuned:• spatial band of filters• # of angular bins• # of spatial bins• hyperparameters of the SVM

• Some heuristics:• histogram smoothing• local histogram normalization

22

Deformable part model

Deformable part model (Felzenszwalb et al., 2008)

• a data-driven version of (Fischer and Elschlager, 1973)• objects, parts and relationships between parts are learned from data• multiscale:

• “root object” at coarse scale• parts at fine scale

• each part contributes to the final score• many parameters to be tuned 23

Bag of words

Extraction of the visual vocabulary

Training/inference

• aim: find a “standardized”description of the image content

• local descriptors are extractedfrom the training set:

• local patches• SIFT (Lowe, 2004)• …

• clustering produces “visual words”• each class is characterized by a

histogram of visual words• classification becomes a

“comparison of histograms”

24

Convolution

• most local image features, for instance thedirectional derivatives, can be detected bylinear spatial filtering.

• convolution operator:

g(i, j) =∑

k,l

f(i + k, j + l)h(k, l)

original vertical Prewitt filtered:[

−1 0 1−1 0 1−1 0 1

]

horizontal Prewitt filtered:[

−1 −1 −10 0 01 1 1

]

gradient magnitude√

(

∂I∂x)2

+(

∂I∂y

)2

25

A bunch of keywords

26

Convolutional Neural Networks

• In 2012, a Convolutional Neural Network (CNN) won the ImageNet challenge by a wide margin:

Imagenet classification with deep convolutional neural networks, (Krizhevsky et al., 2012)

• a CNN is an Artificial Neural Networks (ANN) having a particular architecture, developedspecifically for dealing with images

• nowadays, CNNs are employed even for dealing with information other than images• in the last few years, the term “Deep Learning”” has become popular to identify the approach

to machine learning based on ANN having many layers (“deep”)

27

Artificial Neural Networks

A step backward: Artificial Neural Networks I

• a network of artificial neurons (McCulloch and Pitts, 1943)• biological analogy (with caution)• the output of a neuron is obtained by applying an activation function to the weighted sum of

its inputs

28

A step backward: Artificial Neural Networks II

The perceptron Frank Rosenblatt (1928-1971)

• first trainable Artificial Neural Network: Rosenblatt’s Perceptron (Rosenblatt, 1958)• basic procedure:

• design the architecture (# layers, topology of connections)• choose the (possibly different) activation functions• learn the weights (based on the provided examples)

29

Architectures

Fully connected, 1 hidden layer Fully connected, 2 hidden layers

• left:• 4 + 2 = 6 neurons (input layer is disregarded)• 3× 4 + 4× 2 weights (w) and 4 + 2 biases (b)⇒ 26 parameters

• right:• 4 + 4 + 1 = 9 neurons• 3× 4 + 4× 4 + 4× 1 weights (w) and 4 + 4 + 1 biases (b)⇒ 41 parameters

• a typical CNN has million of parameters (e.g. VGG ha 138M parameters)

30

Activation functions

• output of a neuron

f(∑

i

wixi + b)

• f(·) : R −→ R is the activation function• many possible choices, e.g.

• sigmoidf(x) = 1

1 + e−x

• hyperbolic tangent

f(x) = tanh(x)

• rectified linear unit (ReLU)

f(x) = max {0, x}

• ReLU is currently preferred because• is fast to compute• guarantees faster convergence

sigmoid hyperbolic tangent

ReLU ReLU vs hyperbolic tangent (Krizhevskyet al., 2012)

31

Training as an optimization problem

• N instances, each of dimension D, and K classes• the training set is

T ={

(xi, yi), xi ∈ RD, yi ∈ 1 . . . K, i = 1 . . .N

}

• let the output of the network be

f(x, θ) : RD −→ RK

• define the loss as

J(θ) = 1N∑

i

L(f(xi, θ), yi) + λR(θ)

• many possible choices for L and R, e.g.

L(f(xi, θ), yi) =∑

j=yi

max(0, fj(xi, θ)− fyi(xi, θ) + ∆)

R(θ) =∑

k

∑

l

W2kl

• minimize the loss by a properchoice of θ = (W, b)

• iterative minimization• the instances are presented

repeatedly to the network• each cycle of presentation is

called an “epoch”• the weights and biases are

adapted in such a way to modifythe actual output toward thedesired output

• the most popular training methodsare based on the so calledback-propagation

32

Weight update by gradient descent

• at each step of iteration:• compute an estimate g of the gradient ∇J(θ)• update the weight vector:

θ ← θ − ϵg

• ϵ is called learning rate and is one of the mostimportant learning parameters

33

Stochastic Gradient Descent

• the gradient of the loss function is

∇J(θ) = 1N∇

∑

i

L(f(xi, θ), yi)

• at each iteration, for computing ∇J(θ), the evaluationof the whole training data is required→ very slow forlarge N

• Stochastic Gradient Descent (SGD): at each iteration• sample a minibatch of m examples {x1, . . . , xm} and the

corresponding yi• estimate the gradient as

g =1m∇

∑

iL(f(xi, θ), yi)

• apply the updateθ ← θ − ϵg

• Size of minibatch:• m = 1: gradient direction is

unreliable (based on a singleexample)

• m = N: precise gradient direction,but slow computation

• typical choices for m are 32, 64, 128,to exploit the GPU

34

Improved weight update

• simple update:θ ← θ − ϵg

• momentum:{

v ← αv− ϵgθ ← θ + v

• global adaption of learning rate (“annealing”):• step decay: Reduce the learning rate by some factor

every few epochs.• exponential decay:

ϵ = ϵ0e−kt

• per-parameter learning rate adaption:• AdaGrad• RMSProp• Adam (the most recommended, to date)

35

Computing the gradient ∇J(θ)

• how can we actually compute the gradient ∇J(θ)?• the ANN can be thought of as the composition of many functions:

output = H(input) = [hn ◦ hn−1 ◦ · · · ◦ h0] (input)

• apply the chain rule:(f ◦ g)′(x) = f′(g(x))g′(x)

“the derivative of a composition of functions is the product of the derivatives”

36

Backpropagation

Example of backpropagation for a neuron with sigmoid activation function: σ =1

1 + e−(w0x0+w1x1+w2)

• forward step (from input to output, green): sets the “state” and output of the networkcorresponding to a particular input

• backward step (reverse order, red): by applying the chain rule, computes the derivative of theoutput w.r.t. the input for the given state

• s = 2× (−1) + (−3)× (−2) + (−3) = 1⇒ (1 + e−s)−1 = 0.731• s = (2−0.2)× (−1) + (−3−0.39)× (−2) + (−3+0.2) = 2.18⇒ (1 + e−s)−1 = 0.898← increased

37


Example of CNN: a sequence of convolutional layers, pool layers and a final fully connected layer

38

Convolution Layers i

slide over

all spatial locations

• the 28× 28× 1 “image” on the right is called activation map

• each element of the activation map is obtained by computing wTx + b where w and b are theweights and bias of the filter, and x is a 5× 5× 3 chunk of the image

39

Convolution Layers ii

slide 6 filters over

all spatial locations

• typically, a bank of filters is employed

• using e.g. 6 filters, leads to a 28× 28× 6 activation map

40

Convolution Layers iii

…CONV,

ReLU

CONV,

ReLU

CONV,

ReLU

• a ConvNet is a sequence of convolution layers, interspersed with activation functions

41

Convolution Layers iv

Benefits due to convolutional layers:• sparse connectivity• parameter sharing• equivariance to translation

f(T(x)) = T(f(x))

“activation maps of translated image aretranslated activation maps of the originalimage”

Sparse connectivity (top) vs full connectivity (bottom)

Parameter sharing (top) vs individual parameters (bottom)

42

Pooling layer

Pooling principle Max pooling

• pooling layer downsamples the volume spatially, independently in each depth slice of theinput volume

• the most common pooling operator is max

• also used: average pooling

43

AlexNet

Imagenet classification with deep convolutional neural networks, (Krizhevsky et al., 2012)

• ILSVRC 2012 winner (15.3% top five error)

Convolutional kernels learned from the first convolutional layer of AlexNet

44

Feature visualization i

• aim: understanding how a Convnetworks

• are the activation maps and layers“tuned” to specific patterns?

• which input pattern activates aspecific activation map of aspecific layer?

• “Deconvnet” (Zeiler and Fergus,2014) maps features to pixels(while a convolutional networkdoes the opposite)

A deconvnet layer (left) attached to a convnet layer (right), (Zeiler and Fergus, 2014)

45

Feature visualization ii

Layers 1-3 Layers 4-5

Random subsets of feature maps and corresponding image patches

• layer 2: low level features• layer 3: textures (mainly)

• layer 4: more class-specific• layer 5: entire objects

46

Feature visualization iii

inputimage

Low-levelfeatures

Mid-levelfeatures

High-levelfeatures Classifier label

47

Case studies

Case study: VGGNet

• systematic evaluation of many structures• winner:

• only 3× 3 convolution kernels• stride 1• pad 1• only 2× 2 max pool layers

VGGNet (Simonyan and Zisserman, 2014)

48

Case study: GoogLe Net

GoogLe Net (Szegedy et al., 2015)

• ILSVRC 2014 winner (6.7% top five error)• inception modules:

• “network within a network”• filters with different spatial support in the

same layer

• auxiliary intermediate classifiers (ignored atinference time) Inception module

49

Case study: ResNet

ResNet (He et al., 2016)

ILSVRC 2015 winner (3.6% top five error)

50

Training deep networks

• network depth is of crucial importance• results on ImageNet competition seem to

suggest that ”the deeper, the better” BUT• when the net is “too deep” a degradation

phenomen occurs• the degradation is NOT caused by overfitting

• deeper structures are inherently moredifficult to train

• “proof”:• take a shallow network• add some layers having an equal amount of

inputs and outputs (i.e. layers that couldimplement an identity mapping)

• train both the networks on the same data• the deeper network should produce no

higher training error than its shallowercounterpart, but experiments with currentsolvers show that this is not the case

51

Learning the residuals

• let H(x) be the whole transformation fromtop to bottom

• instead of learning H(x), write

H(x) = x + F(x)

and learn the residuals F(x)

• ResNet blocks learn variations from theidentity

Typical blocks. With a single layer the authors did not observe advantages.

52

Results

• thin curves: training error, bold curves: validation error.• for ResNet, deeper networks show better performance as desired• ResNet-34 has 3.6 billion FLOPS• VGG-19 has 19.6 billion FLOPS

53

Transfer learning

Transfer learning I

inputimage

Low-levelfeatures

Mid-levelfeatures

High-levelfeatures Classifier label

• features may be significant per se (especially features at lower levels)• idea: take a pre-trained network and retrain the classifier only

54

Transfer learning II

1. train on ImageNet 2. for small dataset:treat the CNN as afixed feature extractorand train the classifieronly

3. for medium-sizeddataset: use the old weightsas initialization and retraina bigger portion of the net

55

Example: transfer learning with VGG

• transfer learning based on VGG net• took the output of the last hidden

layer (4096 elements)• trained a linear Support Vector

Machine

56

CVPR course

• Course on Computer Vision and Pattern Recognition, starting September, 2018

SEE YOU IN SEPTEMBER!

57

References i

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In ComputerVision and Pattern Recognition, 5. CVPR 5. IEEE Computer Society Conference on, pages886–893. IEEE, 2005.

Li Fei-Fei, Andrej Karpathy, Justin Johnson, and Serena Yeung. CS n: Convolutional NeuralNetworks for Visual Recognition. Stanford University, 2017. http://cs231n.stanford.edu/.

Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale,deformable part model. In Computer Vision and Pattern Recognition, 8. CVPR 8. IEEEConference on, pages 1–8. IEEE, 2008.

Martin Fischer and Robert Elschlager. The representation and matching of pictorial structure. IEEETrans. Comput, 1:67–92, 1973.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.http://www.deeplearningbook.org.

58

http://cs231n.stanford.edu/

http://www.deeplearningbook.org

References ii

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 770–778, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems, pages1097–1105, 2012.

David G Lowe. Object recognition from local scale-invariant features. In Computer Vision, 999. Theproceedings of the seventh IEEE International Conference on, volume 2, pages 1150–1157. Ieee,1999.

David G Lowe. Distinctive image features from scale-invariant keypoints. International journal ofcomputer vision, 60(2):91–110, 2004.

David Marr. Vision: A computational investigation into the human representation and processing ofvisual information. WH San Francisco: Freeman and Company, 1982.

59

References iii

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.The bulletin of mathematical biophysics, 5(4):115–133, 1943.

Volodymyr Mnih and Geoffrey E Hinton. Learning to detect roads in high-resolution aerial images.In European Conference on Computer Vision, pages 210–223. Springer, 2010.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedicalimage segmentation. In International Conference on Medical Image Computing andComputer-Assisted Intervention, pages 234–241. Springer, 2015.

Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization inthe brain. Psychological review, 65(6):386, 1958.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), June 2015.

60

References iv

Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features.In Computer Vision and Pattern Recognition, . CVPR . Proceedings of the IEEEComputer Society Conference on, volume 1, pages I–I. IEEE, 2001.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014.

61

machi e leaØ i g a d c¡ ÍċöeØ...

Documents