aleksandr mylläri, tatiana myllärikyodo/kokyuroku/contents/pdf/...aleksandr mylläri, tatiana...

9
109 CAS in Teaching Basics of Statistical Learning Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s University St.George’s, Grenada, West Indies [email protected] Abstract. Visual demonstrations provide a convenient way to illustrate the work of algorithms and help students to understand them. Modern Computer Algebra Systems (such as Mathematica or Maple) provide not only opportunity for analytic and numerical calculations, but also convenient tools for solving optimization problems, image construction and animations. It makes CAS a good resource to use in the classroom since the code is usually compact and clear, allowing students to make changes and experiment easily. We model the work of Rosenblatt’s perceptron and simple Support Vector Machines (SVMs) using Wolfram Mathematica. Constructed models are used to demonstrate basic concepts and ideas in the introductory courses on SVMs and Statistical Learning. 1 Introduction Modern systems of Computer Algebra, such as Mathematica or Maple, provide good facilities for simulations and visualization. Visual demonstrations provide a conve‐ nient way to show the work of algorithms and help in understanding them. We use Support Vector Machines (SVMs) to make an introduction to Statistical Learning Theory. As an example, we consider the problem of binary classification. SVMs are attractive from the educational point of view since it is easy to introduce them stepwise from simple perceptron (in the linearly separable case) to more ad‐ vanced classifiers. We start with the perceptron and generalize it to a maximum margin classifier; then we introduce a kernel trick that allows generalization to non‐ linearly separable cases; and finally we consider soft margin classifiers (accepting misclassifications during the training stage). Constructed models are used in in‐ troductory courses on SVMs and Statistical Learning; we use visualization as an explantory tool to illustrate the work of algorithms using Mathematica. Our introductory course on Machine Learning and Support Vector Machines (SVMs) is based on the classic books by N. Cristianini and J. Shave‐Taylor [1] and B. Schölkopf and A.J. Smola [2]. Here, we illustrate the work of the algorithms; detailed descriptions and explanations can be found in [1] and [2]. 109

Upload: others

Post on 19-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

109

CAS in Teaching Basics of Statistical Learning

Aleksandr Mylläri, Tatiana Mylläri

Dept. of Computers & Technology,School of Arts and Sciences

St.George’s UniversitySt.George’s, Grenada, West Indies

[email protected]

Abstract. Visual demonstrations provide a convenient way to illustrate thework of algorithms and help students to understand them. Modern ComputerAlgebra Systems (such as Mathematica or Maple) provide not only opportunityfor analytic and numerical calculations, but also convenient tools for solvingoptimization problems, image construction and animations. It makes CAS agood resource to use in the classroom since the code is usually compact andclear, allowing students to make changes and experiment easily. We model thework of Rosenblatt’s perceptron and simple Support Vector Machines (SVMs)using Wolfram Mathematica. Constructed models are used to demonstratebasic concepts and ideas in the introductory courses on SVMs and StatisticalLearning.

1 Introduction

Modern systems of Computer Algebra, such as Mathematica or Maple, provide goodfacilities for simulations and visualization. Visual demonstrations provide a conve‐nient way to show the work of algorithms and help in understanding them.

We use Support Vector Machines (SVMs) to make an introduction to StatisticalLearning Theory. As an example, we consider the problem of binary classification.SVMs are attractive from the educational point of view since it is easy to introducethem stepwise from simple perceptron (in the linearly separable case) to more ad‐vanced classifiers. We start with the perceptron and generalize it to a maximummargin classifier; then we introduce a kernel trick that allows generalization to non‐linearly separable cases; and finally we consider soft margin classifiers (acceptingmisclassifications during the training stage). Constructed models are used in in‐troductory courses on SVMs and Statistical Learning; we use visualization as anexplantory tool to illustrate the work of algorithms using Mathematica.

Our introductory course on Machine Learning and Support Vector Machines(SVMs) is based on the classic books by N. Cristianini and J. Shave‐Taylor [1] andB. Schölkopf and A.J. Smola [2]. Here, we illustrate the work of the algorithms;detailed descriptions and explanations can be found in [1] and [2].

109

Page 2: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

110

10

. :..

0.

=

00

. . . :. \cdot, \cdot

=..

.: .= . .

oz\cdot\cdot

. .

: :0.0602

Figure 1: An example of linearly separable data (to use with the perceptron algo‐rithm).

SVMs among other Machine Learning algorithms have been included in Math‐ematica since version 10[3]; there were implementations of SVMs done for earlierversions of Mathematica (see, e.g. [4], [5]). Though classifiers implemented in Math‐ematica are very efficient, we choose our own simple classifiers and experimentsbecause we feel that our primary goal‐ students’ understanding of how the algo‐rithms work‐ is best served by using practical simulations.

The binary classification problem is one of the basic problems of machine learn‐ing. We consider supervised learning‐ the case when the learning machine is givena training set of examples with associated labels (usually +1 and -1 ). We use X

to denote the input space and Y to denote the output domain. We have X\in R^{2} ;for binary classification Y=\{-1,1\}.

2 Training data

A training set is a collection of training examples (also called training data). It isusually denoted by

S=((x_{1}, y_{1}), \ldots, (x_{l}, y_{l}))\subseteq(X\cross Y)^{l},

where l is the number of examples. We refer to the x_{i} as examples and the y_{\dot{i}} astheir labels.

Mathematica as well as other Computer Algebra systems easily allows one togenerate data with desired properties, such as a linearly separable case with clearmargin (Fig. 1), non‐linearly separable data (Fig. 2), noisy data with some casesmislabelled, etc. Same data generation algorithms can be used to generate test datasets with specified distributions to check the classifier obtained.

3 Classification

Binary classification is frequently performed by using a real‐valued function f : X\subseteq R^{n}arrow R in the following way: the input x=(x_{1}, \ldots, x_{n})' is assigned to the positiveclass if f(x)\geq 0 , and to the negative class otherwise. The simplest case is when

110

Page 3: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

111

Figure 2: An example of non‐linearly separable data.

f(x) is a linear function of x\in X , so that it can be written as

f( x)=\{w\cdot x\}+b=\sum_{\dot{i}=1}^{n}w_{i}x_{i}+bwhere (w, b)\in R^{n}\cross R are the parameters that control the function and the decisionrule is given by sgn(f(x)) , where we use the convention that sgn(0) =1 . The learningmethodology implies that these parameters must be learned from the data.

A geometric interpretation is that the input space X is split into two parts bythe hyperplane defined by the equation \{w\cdot x\}+b=0 (see Figure 3). In the simplecase we use for visualization, we deal with points on a plane, and a hyperplane is astraight line which divides the plane into two halves corresponding to the inputs ofthe two distinct classes. In Figure 3, the positive region is above and the negativeregion below the line.

As was said in the Introduction, we use a basic sequence of actions: we startwith linearly separable data and a perceptron; continue with a maximum marginclassifier; generalize it to the non‐linear (separable) case; and then generalize to thenon‐separable case (noisy data). We have no intention to give detailed descriptionsof algorithms and methods here. The formulae and descriptions below are onlyprovided to demonstrate that SVMs can be easily programmed in Mathematica.

3.1 Perceptron

The first iterative algorithm for learning linear classifications is the perceptron pro‐posed by Frank Rosenblatt in 1956. It is an ‘on‐line’ and ‘mistake‐driven’ procedure,which starts with an initial weight vector w_{0} (usually w_{0}=0 ‐ the all zero vector)and adapts it each time a training point is misclassified by the current weights. Theperceptron algorithm updates the weight vector w and bias b directly. This pro‐cedure is guaranteed to converge provided there exists a hyperplane that correctlyclassifies the training data (i.e. data are linearly separable). If no such hyperplaneexists the data are said to be nonseparable. Example of the resulting classifier forthe data from Figure 1 is shown on Figure 3.

111

Page 4: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

112

Figure 3: Hyperplane (line in this case) found by the perceptron algorithm.

Figure 4: Margin of the data from Fig. 1.

3.2 Maximum margin classifier

The number of iterations performed by the perceptron algorithm depends on aquantity called the margin. This quantity plays a central role in SVM theory so wegive a more formal definition [1]:

Definition The (functional) margin of an example (x_{i}, y_{i}) with respect to a hy‐perplane (w, b) is the quantity

\gamma_{i}=y_{i}(\{w\cdot x_{i}\}+b) .

Note that \gamma_{i}>0 implies correct classification of (x_{i}, y_{i}) . If we replace functionalmargin by geometric margin we obtain the equivalent quantity for the normalized

linear function ( \frac{1}{\Vert w\Vert}w, \frac{1}{\Vert w\Vert}b) , which measures the Euclidean distances of the points

from the decision boundary in the input space. The margin of a training set S isthe maximum geometric margin over all hyperplanes (straight lines in our case). A

hyperplane realizing this maximum is called a maximal margin hyperplane. The sizeof its margin will be positive for a linearly separable training set. Figure 4 shows themargin of a training set, while Fig. 5 compares perceptron with maximum margin.

Having our (linear) classifier in the form f(x)=\{w\cdot x\rangle+b we are free to scalethe parameters w and b . By requiring that this scaling be such that the point(s)

112

Page 5: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

113

Figure 5: Perceptron (Fig. 3) vs. maximum margin.

closest to the hyperplane satisfy |\langle w, x_{i}\rangle+b|=1 , we obtain its canonical form (w, b)by solving the following optimization problem [2]:

minimize w \in R^{n},b\in R^{T(w)}=\frac{1}{2}\Vert w\Vert^{2},subject to y_{i}(\{x_{i}, w\rangle+b) \geq 1 for all i=1 , . . . , m.

In practice, it is convenient to solve the dual optimization problem that is derivedby introducing a Lagrangian:

L( w, b, \alpha)=\frac{1}{2}\Vert w\Vert^{2}-\sum_{\dot{i}=1}^{m}\alpha_{i}(y_{i}(\langle x_{i}, w\}+b)-1) (1)

with Lagrange multipliers \alpha_{i}\geq 0 . The Lagrangian L must be maximized withrespect to a_{i} , and minimized with respect to w and b . We have

\sum_{i=1}^{m}\alpha_{i}y_{i}=0,and

w=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i}.The patterns x_{i} for which \alpha_{i}>0 . are called Support Vectors.

The dual optimization problem takes the form

maximize \alpha\in R^{m}W(\alpha)=\sum_{i=1}^{m}\alpha_{i}-\frac{1}{2}\sum_{\prime,l,j=1}^{m}\alpha_{i}\alpha_{j}y_{\dot{i}}y_{j}\{x_{i}, x_{j}\rangle,

subject to \alpha_{i}\geq 0, i=1 , . . . , m,

and \sum_{i=1}^{m}\alpha_{i}y_{i}=0.

113

Page 6: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

114

Figure 6: Data from Fig. 2 after nonlinear transformation.

This optimization problem can be easily solved in Mathematica by using Maximize.The classifier takes the form

f(x)= sgn ( \sum_{i=1}^{m}\alpha_{i}y_{\dot{i}}\langle x, x_{i}\}+b)3.3 Non‐linear case

Consider the case shown in Figure 2. Data are not linearly separable, but if we mapour data nonlinearly into a higher‐dimensional space, data could became linearlyseparable. In the case of Figure 2, mapping (x_{1}, x_{2})arrow(x_{1}^{2}, \sqrt{2}x_{1}x_{2}, x_{2}^{2}) will do thetrick (see Figure 6).

Let us note that in the formulation of the dual optimization problem and finalclassifier, training data is present only in the form of dot products. So, if we make anonlinear mapping \Phi , in the target nonlinear space we will be interested only in thedot product \langle\Phi(x), \Phi(x_{i}) }. Let us introduce a (positive definite, see [2] for details)kernel function k(x, x_{i}) :

\{\Phi(x), \Phi(x_{i})\}=k(x, x_{i})

The optimization problem to solve will take the form

maximize \alpha W(\alpha)=\sum_{\dot{i}=1}^{m}\alpha_{i}-\frac{1}{2}\sum_{i,j=1}^{m}\alpha_{i}\alpha_{j}y_{i}y_{j}k(x, x_{i}) ,

subject to \alpha_{i}\geq 0, i=1 , . . . , m,

and \sum_{i=1}^{m}\alpha_{i}y_{i}=0.and the final classifier

f(x)= sgn ( \sum_{i=1}^{m}y_{i}a_{i}k(x, x_{i})+b)

114

Page 7: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

115

..

.

Figure 7: Classifier with polynomial kernel.

Again, this optimization problem can be easily solved in Mathematica by usingMaximize.

Examples of such classifiers are given in Figure 7 (for data shown in Figure2, using a polynomial kernel function) and Figure 8 (using an exponential kernelfunction).

In practice, a separating hyperplane could be non‐existent‐ e.g. data could benoisy, with outliers, or some training examples could be mislabelled. So, it wouldbe good to have an algorithm that can deal with such cases. One approach is tointroduce so‐called slack variables, \xi_{i}\geq 0 , where i=1 , . . . , m , and use relaxedseparation constraints

y_{i}(\langle x_{i}, w\rangle+b)\geq 1-\xi_{i}, i=1, . . . ,

m.

Obviously, by making \xi_{i} large enough, the constraints on (x_{i}, y_{i}) can always be met.To avoid this trivial solution with large values of \xi_{i} , a term to penalize large valuesof \xi_{\dot{i}} can be added to the optimization problem. In the simplest case of the so‐calledC‐SV classifier (for some C>0), the optimization problem takes the form:

minimize w \in H,\xi\in R*m^{T(w,\xi)}=\frac{1}{2}\Vert w\Vert^{2}+\frac{C}{m}\sum_{i=1}^{m}\xi_{i},subject to

\xi_{i}\geq 0, i=1, . . . , m,

and

y_{i}(\{x_{i}, w\rangle+b)\geq 1-\xi_{i}, i=1, . . . ,

m.

The solution, as in the linearly separable case, will have the form

w=\sum_{\dot{i}=1}^{m}\alpha_{i}y_{i}x_{i}

115

Page 8: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

116

Figure 8: Classifier with exponential kernel.

Coefficients \alpha_{i} can be found by solving the corresponding dual (quadratic) optimiza‐tion problem:

maximize \alpha\in R^{m}W(\alpha)=\sum_{i=1}^{m}\alpha_{i}-\frac{1}{2}\sum_{\dot{i}j=1}^{m}\alpha_{i}\alpha_{j}y_{i}y_{j}k(x_{i}, x_{j}) ,

subject to 0 \leq\alpha_{i}\leq\frac{C}{m} for all i=1 , . . . , m,

and \sum_{i=1}^{m}\alpha_{i}y_{i}=0.Similar to the separable case, this optimization problem can be easily solved inMathematica by using Maximize. The threshold b can be obtained by averaging

b=y_{j}- \sum_{i=1}^{m}y_{i}\alpha_{i}k(x_{j}, x_{i})over all points with \alpha_{i}>0 ; in other words, all support vectors. An example of thework of a soft‐margin classifier is given in Figure 9.

4 Conclusions

Mathematica provides good facilities for modeling and visualization of SVMs. Built‐in methods to solve optimization problems, compactness of the Mathematica code,and good graphics make experimentation easy. Thus, students can generate trainingand test data easily. As an option, advanced students can experiment with makingtheir own optimization problem solvers. If the data set is of reasonable size, multiple

116

Page 9: Aleksandr Mylläri, Tatiana Myllärikyodo/kokyuroku/contents/pdf/...Aleksandr Mylläri, Tatiana Mylläri Dept. of Computers & Technology, School of Arts and Sciences St.George’s

117

Figure 9: Example of training soft‐margin classifier.

simulations can be done in moderate time with different parameters, etc. Studentscan experiment with different kernel functions, observe overfitting/underfitting, etc.A rough estimate of the quality of learning could be also easily done by looking atthe proportion of non‐zero coefficients \alpha_{i}- if all (or too large a fraction) of trainingexamples are found to be support vectors, overlearning takes place. All of the above‐mentioned cases of easily constructed learning examples using Mathematica show itto be a useful resourse to use in the classroom for introductory courses on SVMsand Statistical Learning Theory.

References

[1] Shawe‐ Taylor J., Cristianini N. Support Vector Machines and other kernel‐based learn‐ing method. Cambridge University Press (2000)

[2] Schölkopf B., Smola A.J. Learning with Kernels Support Vector Machines, Regular‐ization, optimization and Beyond. Massachusetts Institute of Technology (2001)

[3] https://www.wolfram.com/mathematica/new‐in‐10/high1y‐automated‐machine‐learning/

[4] Paláncz B, Völgyesi L. Support Vector Classifier via Mathematica. Periodica Poly‐technica Civ. Eng, Vol. 48, Nr. 1‐2. pp. 15‐37. (2004)

[5] Nilsson R., Björkegren J., Tegnér J. A Flexible Implementation for Support VectorMachines. The Mathematica Journal, Vol. 10, pp.114‐127. (2006)

117