02 cuando un buen ajuste puede ser malo

8/11/2019 02 Cuando Un Buen Ajuste Puede Ser Malo

1/5

TRENDS in Cognitive Sciences Vol.6 No.10 October 2002

http://tics.trends.com 1364-6613/02/$ see front matter 2002 Elsevier Science Ltd. All rights reserved. PII: S1364-6613(02)01964-2

421Opinion

The explosion of inter est in m odeling cognit ive

processes over the pa st 20 years h as fueled t he

cognitive sciences in man y w ay s. Not only ha s it

opened up new w ays of thinking about research

problems a nd possible solutions, but it ha s also

enabled researchers to gain a better understan ding of

their theories by simulating a computat ional

insta ntia tion of it. Modeling is now sufficiently

mainstr eam tha t one can get the impression tha t the

models themselves are repla cing the th eories from

wh ich they evolved.

What has not kept pace with t he advan ces and

interest in modeling is the development of methods

for evalua ting a nd t esting the models themselves.

A model is not interchan geable with a theory, but

only one of many possible qua ntita tive representat ions

of it. At horough evalua tion of a m odel requires

methods that are sensitive to its qua ntita tive form.

Crit eria used for evalua ting t heories [1], such as

testing their performa nce in an experimental setting,

do not speak to the q uality of the choices that are

made in building their qua ntita tive counterpart s

(i.e. choice of par am eters, how th ey ar e combined)

or their ra mifications. The paucity of such model

selection methods is surprising given th e centra lity of

the problem itself. Wha t could be more funda menta l

tha n deciding between two alt ernative explana tionsof a cognitive process?

Hownotto compare models

Math ematical model are frequently tested aga inst

one another by evaluating how w ell each fits the

dat a genera ted in an experiment or simulation.

Such a t est ma kes sense given tha t one criterion of

model performa nce is that it reproduce the da ta .

A goodness-of-fit mea sure (GOF; see Glossar y) is

invariably used to measure their adequa cy in

achieving this goal. Wha t is measur ed is how mu ch a

models predictions devia te from th e observed da ta [2,3].

The model tha t provides the best fit (i.e. sma llest

devia tion) is fav ored. The logic of th is choice rest s on

the a ssumption th at the model that provides the best

fit to all da ta must be a closer approximat ion to the

cognitive process under investiga tion tha n its

competitors [4].

Such a conclusion is reasonable if measurem ents

were m ad e in a noise-free (i.e. errorless) system . One

of the biggest challenges fa ced by cognitive scientists

is that human a nd animal da ta are noisy. Error ar ises

from several sources, such a s th e imprecision of our

measurement tools, variat ion in par ticipants a nd

their performa nce over t ime. The problem of ran dom

When a good fit canbe badMark A.Pitt and In Jae Myung

How should we select among computational models of cognition? Although itis commonplace to measure how well each model fits the data, this is

insufficient.Good fits can be misleading because they can result from

properties of the model that have nothing to do with it being a close

approximation to the cognitive process of interest (e.g. overfitting).Selection

methods are introduced that factor in these properties when measuring fit.

Their success in outperforming standard goodness-of-fit measures stems from

a focus on measuring the generalizability of a models data-fitting abilities,

which should be the goal of model selection.

Mark A. Pitt*

In Jae Myung

Dept of Psychology, Ohio

State University, 1885

Neil Avenue, Columbus,

Ohio 43210-1222, USA.

*e-mail: [email protected]

hippocampus and neocortexInsights from th e

successes a nd fa ilures of connectionist models of

learning a nd memory.Psychol. R ev. 102, 419457

10 Petersen , S.E . et al . (1988) P ositron emission

tomographic studies of the cortical ana tomy of

single word processing. Nature331, 585589

11 Penfield, W. a nd R oberts, L. (1959)Speech an d

Brain Mechanism, P rinceton University P ress

12 Ojeman n, G.A. (1979) Individua l varia bility in

cortical localization of langua ge. J. N eurosurg.50, 164169

13 Nobre, A.C. et al . (1994) Word r ecognition in th e

huma n inferior temporal lobe.Nature372,

260263

14 Wada , J . and Ra smussen, T. (1960) Intr acar otid

injection of sodium amy ta l for the latera lization of

speech domina nce: experimenta l an d clinical

observations. J. N eurosurg. 17, 266282

15 Pa scual-Leone, A. et al . (1999) Tra nscra nia l

magnetic stimulation: studying the

brain-behaviour relationship by ind uction of

virtua l lesions. Phil os. Trans. R. Soc. Lond. B

Bi ol. Sci. 354, 12291238

16 Pa scual-Leone, A. et al . (2000) Tra nscra nia l

magn etic stimulation in cognitive

neurosciencevirtua l lesion, chronometry, an d

functional connectivity. Curr. Opin. Neurobiol. 10,

232237

17 Marie, P. (1906a) Revision de la question de

laphasie: La troisieme convolution frontale gauche

ne joue aucun t ole speciale da ns la fonction du

langage. Semai ne Medi cale21, 241247.

(Repreinted in C ole, M.F. and Cole, M. eds., (1971),

Pierre Ma ries papers on speech disorders.

New York: Ha fner).18 Marie, P. (1906b) Revision de la quest ion de

lapha sie: Que faut-il penser des a phasies

sous-corticales (aphasies pures)? Semaine

Medicale26, 493500

19 Lash ley, K.S. (1929)Brain Mechanisms and

Intelligence,Un iversity of Chicago Press

20 Lash ley, K.S. (1950) In sear ch of the engram . In

Symposia for th e Society for E xperi mental B iology,

No, 4, Ca mbridge University Press

21 Fodor, J .A. (1983)The Modular ity of Mi nd,

MITPress

22 Vandenberghe, R. et al . (1996) Fun ctiona l

ana tomy of a common semantic system for words

and pictures. Nature383, 254256

23 Mummery, C.J . et al . (1999) Disrupted temporal

lobe connections in sema ntic dementia . Bra in

122, 6173

24 Price, C.J . et al . (1999) Delineat ing necessary and

sufficient neura l systems with functional imaging

studies of neuropsychological pat ients. J. Cogn.

Neurosci. 11, 43714382

25 Price, C.J. a nd Friston, K.J. (1999) Scanning

patients on ta sks they can perform.Hum. Brain

Mapp. 8, 102108

26 Mummery, C.J . et al . (2000) A voxel ba sedmorphometry stud y of sema ntic dementia. The

relation of temporal lobe atr ophy to cognitive

deficit.Ann. N eurol. 47, 3645

27 Friston , K.J . et al . (1999) Multi-subject fM RI

studies and conjunction ana lysis. NeuroImage10,

385396

28 Price, C.J. and Friston, K.J. (1997) Cognitive

conjunctions: a new a pproach to brain activa tion

experiments. Neuroimage5, 261270

29 Ashburner, J . and Friston, K.J. (2000)

Voxel-based morphometry the methods.

NeuroImage11, 805821

30 Howa rd, D. and P att erson, K. (1992)Pyrami ds and

Palm Trees: A Test of Semanti c Access from Pi ctures

and Words, Tham es Valley, Bury S t Ed munds


2/5

error a nd the lengths researchers go to combat it are

evident in th e experimental a nd sta tistical methods

used in the field (e.g. repeat ed measur ement,inferentia l sta tistics). They serve t o hold error in

check so tha t va ria tion due to the menta l process of

interest, what we a re really interested in, will be

visible in the da ta .

Noisy da ta ma ke GOF by itself a poor method of

model selection. As th e simula tion in B ox 1

illustra tes, a G OF measure such as the Root Mean

Sq ua red Err or (RMSE ) is insensitive to the different

sources of varia tion in the data , whether it is ra ndom

error or du e to t he cognitive process of int erest. This

could result in t he selection of a model tha t overfits

the dat a, w hich ma y not be the model that best

approximat es the cognitive process under st udy.

How to compare models

B ecause it is impossible to elimina te error from da ta ,

efforts ha ve focused on im proving model selection in

other wa ys. The preferred solution ha s been to

redefine the problem as one of assessing how w ell a

models fit to one da ta sam ple genera lizes to futur e

sam ples genera ted by t ha t sa me process [5].

GENERALIZABILITY (see Glossary) uses the data and

informat ion about the model itself to make a best-

guessestima te as t o how likely it is the model could

have generated the data sample in hand . In this

approach, a g ood fit is a necessary b ut not sufficient

condition for a model becau se man y models are

capable of fit t ing a da ta set reasonably well. Because

the set of such can dida te models is potent ially qu ite

lar ge, we can only ever infer the likelihood wit h w hich

each model under consideration generated the da ta .

In this regard, generalizability is a stat istical

inference problem. It is no different conceptua lly from

estimating the replicability of a n experimental result

using inferential stat istics or inferring the

chara cteristics of a populat ion from a sa mple.

A han dful of generalizability m easures have been

proposed in th e last s 30 yea rs [6]. By necessity, all

include a measure of GOF t o assess a models fit to the

da ta (see B ox 2). Terms encoding informa tion a bout

the model itself ar e included to level the playin g field

am ong models so tha t one model, by virtu e of its design

choices (i.e. ma thema tical insta ntia tion) does not ha ve

an inherent adva nta ge over its competitors in fit t ing

the dat a best, a nd thus being selected. By nullifying

such model-specific properties, the m odel tha t is the

best a pproxima tion to th e cognitive process understudy, and not simply th e one that absorbs the most

varia tion in the data , will be selected.

Measures of generalizability

Ea rly measures of generalizability such as t he Akaike

informat ion criterion (AIC ) [7,8] a nd B ay esian

informat ion criterion (BIC ) [9] ad dressed th e most

salient differences among models: the number of free

para meters. As is generally w ell known, a model with

man y free para meters can provide a better fit to a

dat a sa mple tha n a model with few para meters, even

if the latt er generated the da ta . The second term in

these mea sures includes a count of the num ber ofpara meters (k). AIC and BI C penalize a model more

as t he number of par am eters increases. To be

selected, the model w ith m ore parameters must

overcome this penalt y a nd provide a substant ially

better fit than a model with fewer para meters. That

is, the superior fit obtained w ith th e extra para meters

must justify th e necessity of those para meters in fully

capturin g th e cognitive process.

An equa lly salient but much less tan gible

dimension along w hich models also differ is in their

functional form, which refers to the wa y in w hich the

para meters a re combined in a models equat ion. More

sophisticat ed selection meth ods, such as Ba yesian

model selection (BM S) [10] a nd MINIMUM DESCRIP TION

LENGTH (MDL) [11,12], a re sen sit ive to a models

functional form as w ell as t o the number of para meters.

Functional form is t aken into a ccount in th e third

term in the MDL measure. In BMS, both are hidden

in the integra l. (Cross Valida tion, alth ough not listed,

is another selection method tha t is t hought to be

sensitive to both dimensions of model COMPLEXITY. I t

involves applying G OF in a n on-sta nda rd w ay. [13])

The second and third t erms in MDL together

provide a mea sure of a models complexity.

Conceptually, complexity refers to tha t char acterist ic


http://tics.trends.com

422 Opinion

Complexity: the property of a model that enables it to fit diverse patterns of data; it is the

flexibility of a model. Although the number of parameters in a model and its functional form can

be useful for gauging its complexity, a more accurate and intuitive measure is the number of

distinct probability distributions that the model can generate by varying its parameters over

their entire range. Details of this geometric complexity measure can be found in [a].

Functional form: the way in which the parameters () and data (x) are combined in a models

equation: y= xand y= + xhave the same number of parameters but different functional forms

(multiplicative versus additive).Generalizability: the ability of a model to fit all data samples generated by the same cognitive

process, not just the currently observed sample (i.e. the models expectedGOF with respect to

new data samples). Generalizability is estimated by combining a models GOF with a measure of

its complexity.

Goodness of fit (GOF): the precision with which a model fits a particular sample of observed

data. The predictions of the model are compared with the observed data. The discrepancy

between the two is measured in a number of ways, such as calculating the root mean squared

error between them.

Minimum Description Length (MDL): a versatile measure of generalizability. MDL was

developed within algorithmic coding theory in computer science [b], where the goal of model

selection is to choose the model that permits the greatest compression of data in its description.

Regularities in the data are assumed to imply redundancy. The more the data can be

compressed by the model by extracting this redundancy, the more that is learned about the

cognitive process.

Overfitting: the case where, in addition to fitting the main trends in the data, a model also fits the

microvariation from this main trend at each data point. Compare the middle and right graph

inserts in Fig. 1.Parameters: variables in a models equation that represent mental constructs or processes;

they are adjusted to improve a models fit to data. For example, in the model y= x, is a

parameter.

Probability density function: a function that specifies the probability of observing each outcome

of a random variable given the value of a models parameter.

References

a Myun g , I . J . et al . (2000) Counting probability d istributions: differentia l geometry a nd

model selection. Proc. Nat l. Acad. Sci. U. S. A. 97, 1117011175

b Li, M. and Vitan yi, P. (1997)An I ntr oduction to Kolm ogorov Complexity and its

Applications, Springer-Verlag

Glossary


3/5



423Opinion

The ability of a model to fit data is a necessary conditionthat all models must satisfy to be taken seriously. GOFmeasures such as RMSE are inappropriate as modelselection methods because all they do is assess fit. Thismyopic focus is problematic when variation in the datacan be due to other factors (e.g. random sampling,individual differences) as well as the cognitive process

under study.The severity of the problem is shown in Table I, which

contains the results of a model recovery simulation usingRMSE. Four datasets were generated from a combinationof the two models (M

Aand M

B), defined as follows:

MA: y= (1+t)a, M

B: y= (b+ct)a where a, b, c > 0. Datasets

were generated from each in the frequencies shown inthe four conditions (rows). In the first condition, all100 samples were generated by M

Awith a = 0.4 and only

sampling error introduced as noise. In the secondcondition, variation due to individual differences wasalso added to the data by using a different parameter value (a = 0.6) halfof the time. In the third condition, half of the data were generated bymodel M

Aand half by M

B. Condition four is the reverse of condition one,

with the data plus sampling error being generated by MB. Models M

Aand

MB

were then fitted to the data in each condition using RMSE. The meanRMSE fits, along with the percentage of time each model provided the

best fit, are shown on the two right-most columns of the Table.A good model selection method must be able to ignore irrelevant

variation in the data (e.g. sampling error, individual differences that arenot being modeled) and recover the model that generated the data.That is, the selection method must be capable of differentiating betweenthe variation that the model was designed to capture and the variationdue to noise. RMSE fails miserably at this task. M

Bwas chosen 100% of

the time in all four conditions and the mean fit is substantially better thanthat of M

A. M

Anever provided a better fit than M

B, even when some or all

of the data were generated by MA

(conditions 13). This is why a good fitcan be bad.

Readers who have some familiarity with modeling might not besurprised by the results given that M

Bhas two more parameters than M

A.

A model with more parameters will always provide a better fit, all otherthings being equal [a]. The typical solution to this problem is to controlfor the number of parameters, but there are at least two reasons why thisfix is unsatisfactory. The most obvious is that it limits model comparison

to those situations in which the number of parameters is equated acrossmodels. The diversity of models in the cognitive sciences can makethis a significant impediment in doing research. Less obvious althougheven more important is that a models data-fitting abilities are alsoaffected by other properties of the model, such as its functional form[b,c,d]. Unless they are taken into account by the selection method,

simply equating for number of parameters will not place the models onan equal footing.

In summary, it may make perfect sense to use GOF to determinewhether a given model can even pass the test of fitting a datasetreasonably well (i.e. capturing the main trends), but going beyondthis and comparing such fits betweenmodels, although intuitive,is risky.

References

a Linhar t, H. a nd Zucchini, W. (1986)Model Selection, Wiley

b Myun g , I . J . et al . (2000) Specia l Issu e on model selection. J. Math.

Psychol. 44 (12)

c L i, S .C . et al . (1996) Using para meter sensitivity a nd interd ependence

to predict model scope and falsifiability. J . Exp. Psychol. Gen. 125,

360369

d Myung, I.J . and Pit t, M.A. (1997) Applying Occam s raz or in modeling

cognition: a Bayesian approach. Psychon. Bu ll . Rev. 4, 7995

Box 1.Why GOF alone is a poor measure of model selection

Listed in Table II are two GOF measures (RMSE, PVAF) and fourgeneralizability measures (AIC, BIC, BMS, MDL). Except for BMS, inwhich the likelihood function is integrated over the parameter space, themeasures of generalizability use the maximized log-likelihood, that is,ln(f(y|

0)), as a GOF measure, the minus of which represents lack of fit.

This fit index is combined with a measure of model complexity to yield ageneralizability measure. The four generalizability criteria differ fromone another in the conceptualization of model complexity. In AIC, thenumber of parameters (k) is the only dimension of complexity that is

considered, whereas BIC also considers sample size (n). BMS and MDLgo one step further and also take into account the functional form of amodels equation. In MDL, this is reflected through the third term of thecriterion equation whereas in BMS it is hidden in the integral. Theseselection methods, except for PVAF, prescribe that the model thatminimizes the given criterion should be chosen.

Reference

a Schervish, M.J . (1995)The Theory of Stat istics, pp. 11115, S pringer -Verla g

Box 2.Model selection criteria as measures of generalizability

Table II. Two GOF Measures, four generalizability measures, and the dimensions of complexity to which each is sensitive

Selection method Criterion equation Dimensions of complexity considered

Root Mean Squared Error RMSE = (SSE/ N)1/2

None

Percent Variance Accounted For PVAF=100(1-SSE/SST) None

Akaike Information Criterion AIC = -2 ln((y|0)) + 2k Number of parameters

Bayesian Information Criterion BIC = -2 ln((y|0)) + kln(n) Number of parameters, sample size

Bayesian Model Selection BMS=ln ( y|)()d Number of parameters, sample size, functional formMinimum Description Length MDL=ln(( y|0)) + (k/2)ln(n/2)+ln dIdet ))(( Number of parameters, sample size, functional form

In the equations above, ydenotes observed data, is the models parameter, 0is the parameter value that maximizes the likelihood function f(y|), kis the number

of parameters, n is the sample size, N is the number of data points fitted, SSE is the minimized sum of the squared errors between observations and predictions,

SST is the sum of the squares total, () is the parameter prior density, I() is the Fisher information matrix in mathematical statistics [a], detdenotes thedeterminant of a matrix, and lndenotes the natural logarithm of base e.

Table I. Results of a model recovery simulation in which a GOF measure

(RMSE) was used to discriminate models when the source of the error was

varied.

Condition (sources of

variation)

Model the data were

generated from

Model fitted

MAa = 0.4

MAa = 0.6

MB MA MB

(1) Sampling error 100 0.040 (0%) 0.029 (100%)

(2) Sampling error +

individual differences

50 50 0.041 (0%) 0.029 (100%)

(3) Different models 50 50 0.075 (0%) 0.029 (100%)

(4) Sampling error 100 0.079 (0%) 0.029 (100%)


4/5

of a model that makes it flexible and easily a ble to fit

diverse patterns of dat a, usually by a small

adjustment in one of its par ameters. In Fig. 1, it is

wha t ena bles the model (wa vy line) in the lower rightgraph t o provide a better fit t o the dat a (dots) tha n

tha t in the middle and left graphs. Both FUNCTIONAL

FORM and the number of PARAMETERS contr ibute to

model complexity.

How complexity is related to generalizability and GOF:

the dilemma of Occams razor

The notion of model complexit y is a g ood vehicle wi th

which to illustrate furt her the goal of generalizability

and distinguish it from G OF. As demonstrat ed in B ox 1,

fit is maximized w ith G OF. Because additional

complexity w ill improve fit, th e tw o are positively

relat ed; this is depicted by the t op function in Fig. 1.

Generalizability, on the other hand, is not related t o

complexity so stra ightforw ar dly. Ra ther, its function

follows t he same t rajectory as tha t of GOF up to a

certain point, a fter w hich fit declines as complexity

increases.

Why do th e tw o functions diverge? The da ta being

fitted ha ve a certa in degree of complexity th at reflects

th e opera tion of the cognit ive process. This point

corresponds to the peak of the genera lizability fun ction.

Any ad ditiona l complexity beyond tha t needed to

capture t he underlying process (to the right of the peak)

will cause the model to overfit the dat a by a lso

capturing the microvaria tion due to ra ndom error,

an d thu s reduce genera lizability. The reason both

functions overlap to the left ha lf of the peak is tha t t he

model itself must be sufficiently complex to fit t he

dat a well. The model will underfit the da ta if it lacks

the necessary complexity (i.e. it is t oo simple), a s

illustra ted in t he lower left gra ph in Fig. 1. The dilemma

tha t is faced in trying to maximize generalizabilityshould be clear : a delicat e bala nce must be struck

betw een sufficient complexity on the one hand a nd

good generalizability on t he other. MDL achieves th is

bala nce by choosing t he model wh ose complexity is

most justified by considering th e complexity of the

da ta relat ive to the complexity of the model.

Selection methods at work:the proof is in the pudding

Generalizabili ty measures like BMS and MDL have

thus fa r been developed for test ing only th ose models

tha t can be described in terms of a PROBABILITY DENSITY

FUNCTION. Exa mples of such sta tistical models are

models of categorizat ion and models of informa tion

int egra tion [14,15].

The followin g m odel-recovery t ests d emonstra te

the rela tive performan ce of the t hree classes of

selection m ethods sh own in Table 1. The t hree m odels

described (see Ta ble footn ote) w ere compar ed. In ea ch

simulation, 1000 dat asets were generated from each

model. Ea ch selection method w as t hen tested on its

abilit y to determine w hich of the th ree models did in

fact genera te the 3000 dat aset s. A good selection

method sh ould be able to discern correctly t he model

tha t generat ed the da ta . The ideal outcome is one in

wh ich each model generalizes best only to dat a



424 Opinion

TRENDS in Cognitive Sciences

Goodness of fit

Generalizability

Model complexity

Overfitting

Poor

Good

Mod

elfit

Fig. 1. Relationship between goodness of fit and generalizability as a function of model complexity.

The y-axis represents any fit index, where a larger value indicates a better fit (e.g. percent variance

accounted for). The three smaller graphs along the x-axis show how fit improves as complexity

increases. In the left graph, the model (represneted by the line) is not complex enough to match the

complexity of the data (dots). The two are well matched in complexity in the middle graph, which is

why this occurs at the peak of the generalizability function. In the right graph, the model is more

complex than the data, fitting random error. It has better goodness of fit, but is overfitting the data.

Table 1. Model recovery performance (percentage of

correct recoveries) for three models using three

selection methods

Selection

method

Model

fitted

Model the data were generated

from

M1

M2

M3

PVAF M1

0 0 0

M2

38 97 30

M3

62 3 70

AIC M1

79 0 0

M2

9 97 30

M3

12 3 70

MDL M1

86 0 0

M2

1 92 8

M3

13 8 92

Models M1, M2and M3were defined as follows: M1: y = (1+t)-a;

M2: y = (b+t)-a; M3: y = (1+bt)

-a. In the model equations, a, b and c

are parameters that were adjusted to fit each model to the data,

which were generated using the same five points,

t = 0.1, 2.1, 4.1 6.1, 8.1. Each sample of five observations was

sampled from the binomially probability distribution of size n = 50.

One thousand samples were generated from each model and

served as the data to fit. Each selection method was then used to

determine which model generated each of the samples.

The percentage of time each model was chosen for each dataset

is shown.


5/5

Voice your Opinion in TICS

The pages of Trends in Cognitive Sciencesprovide a unique forum for debate for all cognitive scientists. Our Opinionsectionfeatures articles that present a personal viewpoint on a research topic, and the Letters page welcomes responses to any of the

articles in previous issues. If you would like to respond to the issues raised in this months Trends in Cognitive Sciencesor,alternatively, if you think there are topics that should be featured in the Opinionsection, please write to:

The Editor, Trends in Cognitive Sciences84 Theobalds Road, London,

UK WC1X 8RR,or e-mail: [email protected]

generated by itself. In the 33 ma trices in Table 1,

this corresponds t o perfect recovery in th e diagona l

going from the upper left to the lower right . Errors

(cells in th e off diagonals) reveal a bias in the selection

method t owa rd eith er the more or less complex model.

The top mat rix shows th e results using P VAF, wit h

th e percenta ge of correct recoveries in ea ch cell. The

problem with using a GOF measure shows up in the

first column of the mat rix, where the data were

genera ted by the one-para meter model M1. M

1never

recovered its own dat a, w ith one of the tw o-para meter

models alwa ys fit t ing the da ta best. Comparison of

these dat a w ith the middle mat rix shows that using

AIC r ectifies this problem. Becau se of its sensitivity to

the number of param eters in a m odel, it does a

reasona bly good job of distinguishing betw een dat a

generated by M1

or M2. By contra st , note how model

recovery performa nce remains consta nt a cross these

tw o mat rices in columns 2 a nd 3. This is not

surprising because AIC, like P VAF, is insensitive t o

functional form, wh ich is the dimension a long which

M2 and M3 differ. Only when MDL is used is animprovement in m odel recovery performa nce

observed in these columns (bottom ma trix).

Why did MDL not perform perfectly, and w hy did it

perform slightly worse tha n AIC and P VAF w hen the

dat a came from M2

(middle column)? Recall th at , like

a st at istical test, model selection is an inference

problem. The qua lity of th e inference depends on th e

informat ion tha t the selection method uses. Even

though MDL ma kes use of all of the informa tion

ava ilable (dat a and the model), this does not

guara ntee success, but it greatly improve the chan ces

of the inference being correct [16]. MDL will

outperform selection met hods such a s AIC an d P VAF

most of the time, but neither it nor its Ba yesian

counterpart, BMS , are infa llible.

The inferentia l na ture of model selection ma kes it

imperat ive to interpret m odel recovery result s in the

context of the data tha t w ere used in the test itself.

Poor performance might not indicate a problem with

the selection method, but ra ther a constra int on the

resolving power of the selection m ethod in

discriminating which model could ha ve generated the

da ta . Conversely, biases in the selection method itself

can m asq uera de as good model-recovery

performance. In a recent comparison of BMS and

RMSD , we dem onstra ted both of these outcomes [17].

Conclusion

Methods like AIC and MDL can give the impression

tha t model selection can be automat ed and require

minima l thought. H owever, choosing betw eencompeting m odels is no easier or less subjective th a n

choosing bet ween competing theories. Selection

methods should be viewed as a tool tha t researchers

can use to gain addit ional information about the

models under considera tion. These tools, however, are

ignoran t of other properties of the models, such a s

their plausibility or the qua lity of the data , so it is

ina ppropria te to decide between them solely on wh at

is lea rned from a test of generaliza bility.



425Opinion

References

1 J acobs, A.M. and G rainger, J . (1994) Models of

visual word recognitionsampling the sta te of the

art . J. Exp. Psychol. H um. Percept. Perform. 29,13111334

2 Smith, J .D. and Minda, J .P. (2000) Thirty

categorizat ion results in search of a model. J. Exp.

Psychol. L earn. M em. Cogn. 26, 327

3 Wichman, F.A. and Hill, N.J . (2001) The

psychometric function: I. Fitting, sam pling and

goodness of fit. Percept. Psychophys. 63, 12931313

4 Roberts, S. a nd P ashler, H. (2000) How persuasive

is a good fit? Acomment on th eory testing.

Psychol . Rev. 107, 358367

5 Bishop, C.M. (1995)Neural N etwork s for Pattern

Recognition, (Chapt ers 1 and 9), Oxford

University P ress

6 Myun g , I . J . et al . (2000) Specia l Issu e on model

selection. J. M ath. Psychol. 44, 12

7 Akaike, H. (1973) Informa tion theory and an

extension of the ma ximum likelihood principle. InSecond In tern ational Symposium on Inform ati on

Theory(Petrox, B .N. a nd C aski, F.), pp. 267281,

Akademiai Kiado, B udapest

8 Bur ha m, K.P. and Anderson, D.R. (1998)Model

Selection and I nference. A Practical I nform ation-

Theoretic Approach (Chapter 2), Springer-Verlag

9 Schwa rz, G. (1978) Estima ting the dimension of a

model. Ann. Stat. 6, 461464

10 Kas s, R.E. and Ra ftery, A.E. (1995) Ba yes factors.

J. Am. St at. Assoc.90, 773795

11 Rissanen, J . (1996) Fisher information and

stochastic complexity. IE EE Tran s. In form. Theor.

42, 4047

12 Ha nsen, M.H. a nd Yu, B. (2001) Model selection

an d the principle of minimum d escription length.

J. Am. Stat . Assoc. 96, 746774

13 Myung, I .J . and Pitt , M.A. (in press). Modelevalua tion, testing and selection. In H andbook of

Cognition(Lambert, K. and G oldstone, R. eds.), Sage

14 Nosofsky, R.M. (1986) Attent ion, similarity and

the identification-cat egorization relationship.

J. Exp. Psychol. Gen. 115, 3957

15 Oden, G.C. a nd Massa ro, D.W. (1978) Integra tion

of featura l informat ion in speech perception.

Psychol. R ev. 85, 172191

16 Pi t t , M.A. et al . (2002) Towa rd a meth od of

selecting a mong computationa l models of

cognition. Psychol . Rev. 109, 472491

17 Pi t t , M.A. et al . Flexibility versus generalizability

in model selection. Psychon. Bu ll . Rev. (in press)

Acknowledgement

Both authors were

supported by NIH Grant

MH57472.

02 cuando un buen ajuste puede ser malo

Documents