02 cuando un buen ajuste puede ser malo
TRANSCRIPT
-
8/11/2019 02 Cuando Un Buen Ajuste Puede Ser Malo
1/5
TRENDS in Cognitive Sciences Vol.6 No.10 October 2002
http://tics.trends.com 1364-6613/02/$ see front matter 2002 Elsevier Science Ltd. All rights reserved. PII: S1364-6613(02)01964-2
421Opinion
The explosion of inter est in m odeling cognit ive
processes over the pa st 20 years h as fueled t he
cognitive sciences in man y w ay s. Not only ha s it
opened up new w ays of thinking about research
problems a nd possible solutions, but it ha s also
enabled researchers to gain a better understan ding of
their theories by simulating a computat ional
insta ntia tion of it. Modeling is now sufficiently
mainstr eam tha t one can get the impression tha t the
models themselves are repla cing the th eories from
wh ich they evolved.
What has not kept pace with t he advan ces and
interest in modeling is the development of methods
for evalua ting a nd t esting the models themselves.
A model is not interchan geable with a theory, but
only one of many possible qua ntita tive representat ions
of it. At horough evalua tion of a m odel requires
methods that are sensitive to its qua ntita tive form.
Crit eria used for evalua ting t heories [1], such as
testing their performa nce in an experimental setting,
do not speak to the q uality of the choices that are
made in building their qua ntita tive counterpart s
(i.e. choice of par am eters, how th ey ar e combined)
or their ra mifications. The paucity of such model
selection methods is surprising given th e centra lity of
the problem itself. Wha t could be more funda menta l
tha n deciding between two alt ernative explana tionsof a cognitive process?
Hownotto compare models
Math ematical model are frequently tested aga inst
one another by evaluating how w ell each fits the
dat a genera ted in an experiment or simulation.
Such a t est ma kes sense given tha t one criterion of
model performa nce is that it reproduce the da ta .
A goodness-of-fit mea sure (GOF; see Glossar y) is
invariably used to measure their adequa cy in
achieving this goal. Wha t is measur ed is how mu ch a
models predictions devia te from th e observed da ta [2,3].
The model tha t provides the best fit (i.e. sma llest
devia tion) is fav ored. The logic of th is choice rest s on
the a ssumption th at the model that provides the best
fit to all da ta must be a closer approximat ion to the
cognitive process under investiga tion tha n its
competitors [4].
Such a conclusion is reasonable if measurem ents
were m ad e in a noise-free (i.e. errorless) system . One
of the biggest challenges fa ced by cognitive scientists
is that human a nd animal da ta are noisy. Error ar ises
from several sources, such a s th e imprecision of our
measurement tools, variat ion in par ticipants a nd
their performa nce over t ime. The problem of ran dom
When a good fit canbe badMark A.Pitt and In Jae Myung
How should we select among computational models of cognition? Although itis commonplace to measure how well each model fits the data, this is
insufficient.Good fits can be misleading because they can result from
properties of the model that have nothing to do with it being a close
approximation to the cognitive process of interest (e.g. overfitting).Selection
methods are introduced that factor in these properties when measuring fit.
Their success in outperforming standard goodness-of-fit measures stems from
a focus on measuring the generalizability of a models data-fitting abilities,
which should be the goal of model selection.
Mark A. Pitt*
In Jae Myung
Dept of Psychology, Ohio
State University, 1885
Neil Avenue, Columbus,
Ohio 43210-1222, USA.
*e-mail: [email protected]
hippocampus and neocortexInsights from th e
successes a nd fa ilures of connectionist models of
learning a nd memory.Psychol. R ev. 102, 419457
10 Petersen , S.E . et al . (1988) P ositron emission
tomographic studies of the cortical ana tomy of
single word processing. Nature331, 585589
11 Penfield, W. a nd R oberts, L. (1959)Speech an d
Brain Mechanism, P rinceton University P ress
12 Ojeman n, G.A. (1979) Individua l varia bility in
cortical localization of langua ge. J. N eurosurg.50, 164169
13 Nobre, A.C. et al . (1994) Word r ecognition in th e
huma n inferior temporal lobe.Nature372,
260263
14 Wada , J . and Ra smussen, T. (1960) Intr acar otid
injection of sodium amy ta l for the latera lization of
speech domina nce: experimenta l an d clinical
observations. J. N eurosurg. 17, 266282
15 Pa scual-Leone, A. et al . (1999) Tra nscra nia l
magnetic stimulation: studying the
brain-behaviour relationship by ind uction of
virtua l lesions. Phil os. Trans. R. Soc. Lond. B
Bi ol. Sci. 354, 12291238
16 Pa scual-Leone, A. et al . (2000) Tra nscra nia l
magn etic stimulation in cognitive
neurosciencevirtua l lesion, chronometry, an d
functional connectivity. Curr. Opin. Neurobiol. 10,
232237
17 Marie, P. (1906a) Revision de la question de
laphasie: La troisieme convolution frontale gauche
ne joue aucun t ole speciale da ns la fonction du
langage. Semai ne Medi cale21, 241247.
(Repreinted in C ole, M.F. and Cole, M. eds., (1971),
Pierre Ma ries papers on speech disorders.
New York: Ha fner).18 Marie, P. (1906b) Revision de la quest ion de
lapha sie: Que faut-il penser des a phasies
sous-corticales (aphasies pures)? Semaine
Medicale26, 493500
19 Lash ley, K.S. (1929)Brain Mechanisms and
Intelligence,Un iversity of Chicago Press
20 Lash ley, K.S. (1950) In sear ch of the engram . In
Symposia for th e Society for E xperi mental B iology,
No, 4, Ca mbridge University Press
21 Fodor, J .A. (1983)The Modular ity of Mi nd,
MITPress
22 Vandenberghe, R. et al . (1996) Fun ctiona l
ana tomy of a common semantic system for words
and pictures. Nature383, 254256
23 Mummery, C.J . et al . (1999) Disrupted temporal
lobe connections in sema ntic dementia . Bra in
122, 6173
24 Price, C.J . et al . (1999) Delineat ing necessary and
sufficient neura l systems with functional imaging
studies of neuropsychological pat ients. J. Cogn.
Neurosci. 11, 43714382
25 Price, C.J. a nd Friston, K.J. (1999) Scanning
patients on ta sks they can perform.Hum. Brain
Mapp. 8, 102108
26 Mummery, C.J . et al . (2000) A voxel ba sedmorphometry stud y of sema ntic dementia. The
relation of temporal lobe atr ophy to cognitive
deficit.Ann. N eurol. 47, 3645
27 Friston , K.J . et al . (1999) Multi-subject fM RI
studies and conjunction ana lysis. NeuroImage10,
385396
28 Price, C.J. and Friston, K.J. (1997) Cognitive
conjunctions: a new a pproach to brain activa tion
experiments. Neuroimage5, 261270
29 Ashburner, J . and Friston, K.J. (2000)
Voxel-based morphometry the methods.
NeuroImage11, 805821
30 Howa rd, D. and P att erson, K. (1992)Pyrami ds and
Palm Trees: A Test of Semanti c Access from Pi ctures
and Words, Tham es Valley, Bury S t Ed munds
-
8/11/2019 02 Cuando Un Buen Ajuste Puede Ser Malo
2/5
error a nd the lengths researchers go to combat it are
evident in th e experimental a nd sta tistical methods
used in the field (e.g. repeat ed measur ement,inferentia l sta tistics). They serve t o hold error in
check so tha t va ria tion due to the menta l process of
interest, what we a re really interested in, will be
visible in the da ta .
Noisy da ta ma ke GOF by itself a poor method of
model selection. As th e simula tion in B ox 1
illustra tes, a G OF measure such as the Root Mean
Sq ua red Err or (RMSE ) is insensitive to the different
sources of varia tion in the data , whether it is ra ndom
error or du e to t he cognitive process of int erest. This
could result in t he selection of a model tha t overfits
the dat a, w hich ma y not be the model that best
approximat es the cognitive process under st udy.
How to compare models
B ecause it is impossible to elimina te error from da ta ,
efforts ha ve focused on im proving model selection in
other wa ys. The preferred solution ha s been to
redefine the problem as one of assessing how w ell a
models fit to one da ta sam ple genera lizes to futur e
sam ples genera ted by t ha t sa me process [5].
GENERALIZABILITY (see Glossary) uses the data and
informat ion about the model itself to make a best-
guessestima te as t o how likely it is the model could
have generated the data sample in hand . In this
approach, a g ood fit is a necessary b ut not sufficient
condition for a model becau se man y models are
capable of fit t ing a da ta set reasonably well. Because
the set of such can dida te models is potent ially qu ite
lar ge, we can only ever infer the likelihood wit h w hich
each model under consideration generated the da ta .
In this regard, generalizability is a stat istical
inference problem. It is no different conceptua lly from
estimating the replicability of a n experimental result
using inferential stat istics or inferring the
chara cteristics of a populat ion from a sa mple.
A han dful of generalizability m easures have been
proposed in th e last s 30 yea rs [6]. By necessity, all
include a measure of GOF t o assess a models fit to the
da ta (see B ox 2). Terms encoding informa tion a bout
the model itself ar e included to level the playin g field
am ong models so tha t one model, by virtu e of its design
choices (i.e. ma thema tical insta ntia tion) does not ha ve
an inherent adva nta ge over its competitors in fit t ing
the dat a best, a nd thus being selected. By nullifying
such model-specific properties, the m odel tha t is the
best a pproxima tion to th e cognitive process understudy, and not simply th e one that absorbs the most
varia tion in the data , will be selected.
Measures of generalizability
Ea rly measures of generalizability such as t he Akaike
informat ion criterion (AIC ) [7,8] a nd B ay esian
informat ion criterion (BIC ) [9] ad dressed th e most
salient differences among models: the number of free
para meters. As is generally w ell known, a model with
man y free para meters can provide a better fit to a
dat a sa mple tha n a model with few para meters, even
if the latt er generated the da ta . The second term in
these mea sures includes a count of the num ber ofpara meters (k). AIC and BI C penalize a model more
as t he number of par am eters increases. To be
selected, the model w ith m ore parameters must
overcome this penalt y a nd provide a substant ially
better fit than a model with fewer para meters. That
is, the superior fit obtained w ith th e extra para meters
must justify th e necessity of those para meters in fully
capturin g th e cognitive process.
An equa lly salient but much less tan gible
dimension along w hich models also differ is in their
functional form, which refers to the wa y in w hich the
para meters a re combined in a models equat ion. More
sophisticat ed selection meth ods, such as Ba yesian
model selection (BM S) [10] a nd MINIMUM DESCRIP TION
LENGTH (MDL) [11,12], a re sen sit ive to a models
functional form as w ell as t o the number of para meters.
Functional form is t aken into a ccount in th e third
term in the MDL measure. In BMS, both are hidden
in the integra l. (Cross Valida tion, alth ough not listed,
is another selection method tha t is t hought to be
sensitive to both dimensions of model COMPLEXITY. I t
involves applying G OF in a n on-sta nda rd w ay. [13])
The second and third t erms in MDL together
provide a mea sure of a models complexity.
Conceptually, complexity refers to tha t char acterist ic
TRENDS in Cognitive Sciences Vol.6 No.10 October 2002
http://tics.trends.com
422 Opinion
Complexity: the property of a model that enables it to fit diverse patterns of data; it is the
flexibility of a model. Although the number of parameters in a model and its functional form can
be useful for gauging its complexity, a more accurate and intuitive measure is the number of
distinct probability distributions that the model can generate by varying its parameters over
their entire range. Details of this geometric complexity measure can be found in [a].
Functional form: the way in which the parameters () and data (x) are combined in a models
equation: y= xand y= + xhave the same number of parameters but different functional forms
(multiplicative versus additive).Generalizability: the ability of a model to fit all data samples generated by the same cognitive
process, not just the currently observed sample (i.e. the models expectedGOF with respect to
new data samples). Generalizability is estimated by combining a models GOF with a measure of
its complexity.
Goodness of fit (GOF): the precision with which a model fits a particular sample of observed
data. The predictions of the model are compared with the observed data. The discrepancy
between the two is measured in a number of ways, such as calculating the root mean squared
error between them.
Minimum Description Length (MDL): a versatile measure of generalizability. MDL was
developed within algorithmic coding theory in computer science [b], where the goal of model
selection is to choose the model that permits the greatest compression of data in its description.
Regularities in the data are assumed to imply redundancy. The more the data can be
compressed by the model by extracting this redundancy, the more that is learned about the
cognitive process.
Overfitting: the case where, in addition to fitting the main trends in the data, a model also fits the
microvariation from this main trend at each data point. Compare the middle and right graph
inserts in Fig. 1.Parameters: variables in a models equation that represent mental constructs or processes;
they are adjusted to improve a models fit to data. For example, in the model y= x, is a
parameter.
Probability density function: a function that specifies the probability of observing each outcome
of a random variable given the value of a models parameter.
References
a Myun g , I . J . et al . (2000) Counting probability d istributions: differentia l geometry a nd
model selection. Proc. Nat l. Acad. Sci. U. S. A. 97, 1117011175
b Li, M. and Vitan yi, P. (1997)An I ntr oduction to Kolm ogorov Complexity and its
Applications, Springer-Verlag
Glossary
-
8/11/2019 02 Cuando Un Buen Ajuste Puede Ser Malo
3/5
TRENDS in Cognitive Sciences Vol.6 No.10 October 2002
http://tics.trends.com
423Opinion
The ability of a model to fit data is a necessary conditionthat all models must satisfy to be taken seriously. GOFmeasures such as RMSE are inappropriate as modelselection methods because all they do is assess fit. Thismyopic focus is problematic when variation in the datacan be due to other factors (e.g. random sampling,individual differences) as well as the cognitive process
under study.The severity of the problem is shown in Table I, which
contains the results of a model recovery simulation usingRMSE. Four datasets were generated from a combinationof the two models (M
Aand M
B), defined as follows:
MA: y= (1+t)a, M
B: y= (b+ct)a where a, b, c > 0. Datasets
were generated from each in the frequencies shown inthe four conditions (rows). In the first condition, all100 samples were generated by M
Awith a = 0.4 and only
sampling error introduced as noise. In the secondcondition, variation due to individual differences wasalso added to the data by using a different parameter value (a = 0.6) halfof the time. In the third condition, half of the data were generated bymodel M
Aand half by M
B. Condition four is the reverse of condition one,
with the data plus sampling error being generated by MB. Models M
Aand
MB
were then fitted to the data in each condition using RMSE. The meanRMSE fits, along with the percentage of time each model provided the
best fit, are shown on the two right-most columns of the Table.A good model selection method must be able to ignore irrelevant
variation in the data (e.g. sampling error, individual differences that arenot being modeled) and recover the model that generated the data.That is, the selection method must be capable of differentiating betweenthe variation that the model was designed to capture and the variationdue to noise. RMSE fails miserably at this task. M
Bwas chosen 100% of
the time in all four conditions and the mean fit is substantially better thanthat of M
A. M
Anever provided a better fit than M
B, even when some or all
of the data were generated by MA
(conditions 13). This is why a good fitcan be bad.
Readers who have some familiarity with modeling might not besurprised by the results given that M
Bhas two more parameters than M
A.
A model with more parameters will always provide a better fit, all otherthings being equal [a]. The typical solution to this problem is to controlfor the number of parameters, but there are at least two reasons why thisfix is unsatisfactory. The most obvious is that it limits model comparison
to those situations in which the number of parameters is equated acrossmodels. The diversity of models in the cognitive sciences can makethis a significant impediment in doing research. Less obvious althougheven more important is that a models data-fitting abilities are alsoaffected by other properties of the model, such as its functional form[b,c,d]. Unless they are taken into account by the selection method,
simply equating for number of parameters will not place the models onan equal footing.
In summary, it may make perfect sense to use GOF to determinewhether a given model can even pass the test of fitting a datasetreasonably well (i.e. capturing the main trends), but going beyondthis and comparing such fits betweenmodels, although intuitive,is risky.
References
a Linhar t, H. a nd Zucchini, W. (1986)Model Selection, Wiley
b Myun g , I . J . et al . (2000) Specia l Issu e on model selection. J. Math.
Psychol. 44 (12)
c L i, S .C . et al . (1996) Using para meter sensitivity a nd interd ependence
to predict model scope and falsifiability. J . Exp. Psychol. Gen. 125,
360369
d Myung, I.J . and Pit t, M.A. (1997) Applying Occam s raz or in modeling
cognition: a Bayesian approach. Psychon. Bu ll . Rev. 4, 7995
Box 1.Why GOF alone is a poor measure of model selection
Listed in Table II are two GOF measures (RMSE, PVAF) and fourgeneralizability measures (AIC, BIC, BMS, MDL). Except for BMS, inwhich the likelihood function is integrated over the parameter space, themeasures of generalizability use the maximized log-likelihood, that is,ln(f(y|
0)), as a GOF measure, the minus of which represents lack of fit.
This fit index is combined with a measure of model complexity to yield ageneralizability measure. The four generalizability criteria differ fromone another in the conceptualization of model complexity. In AIC, thenumber of parameters (k) is the only dimension of complexity that is
considered, whereas BIC also considers sample size (n). BMS and MDLgo one step further and also take into account the functional form of amodels equation. In MDL, this is reflected through the third term of thecriterion equation whereas in BMS it is hidden in the integral. Theseselection methods, except for PVAF, prescribe that the model thatminimizes the given criterion should be chosen.
Reference
a Schervish, M.J . (1995)The Theory of Stat istics, pp. 11115, S pringer -Verla g
Box 2.Model selection criteria as measures of generalizability
Table II. Two GOF Measures, four generalizability measures, and the dimensions of complexity to which each is sensitive
Selection method Criterion equation Dimensions of complexity considered
Root Mean Squared Error RMSE = (SSE/ N)1/2
None
Percent Variance Accounted For PVAF=100(1-SSE/SST) None
Akaike Information Criterion AIC = -2 ln((y|0)) + 2k Number of parameters
Bayesian Information Criterion BIC = -2 ln((y|0)) + kln(n) Number of parameters, sample size
Bayesian Model Selection BMS=ln ( y|)()d Number of parameters, sample size, functional formMinimum Description Length MDL=ln(( y|0)) + (k/2)ln(n/2)+ln dIdet ))(( Number of parameters, sample size, functional form
In the equations above, ydenotes observed data, is the models parameter, 0is the parameter value that maximizes the likelihood function f(y|), kis the number
of parameters, n is the sample size, N is the number of data points fitted, SSE is the minimized sum of the squared errors between observations and predictions,
SST is the sum of the squares total, () is the parameter prior density, I() is the Fisher information matrix in mathematical statistics [a], detdenotes thedeterminant of a matrix, and lndenotes the natural logarithm of base e.
Table I. Results of a model recovery simulation in which a GOF measure
(RMSE) was used to discriminate models when the source of the error was
varied.
Condition (sources of
variation)
Model the data were
generated from
Model fitted
MAa = 0.4
MAa = 0.6
MB MA MB
(1) Sampling error 100 0.040 (0%) 0.029 (100%)
(2) Sampling error +
individual differences
50 50 0.041 (0%) 0.029 (100%)
(3) Different models 50 50 0.075 (0%) 0.029 (100%)
(4) Sampling error 100 0.079 (0%) 0.029 (100%)
-
8/11/2019 02 Cuando Un Buen Ajuste Puede Ser Malo
4/5
of a model that makes it flexible and easily a ble to fit
diverse patterns of dat a, usually by a small
adjustment in one of its par ameters. In Fig. 1, it is
wha t ena bles the model (wa vy line) in the lower rightgraph t o provide a better fit t o the dat a (dots) tha n
tha t in the middle and left graphs. Both FUNCTIONAL
FORM and the number of PARAMETERS contr ibute to
model complexity.
How complexity is related to generalizability and GOF:
the dilemma of Occams razor
The notion of model complexit y is a g ood vehicle wi th
which to illustrate furt her the goal of generalizability
and distinguish it from G OF. As demonstrat ed in B ox 1,
fit is maximized w ith G OF. Because additional
complexity w ill improve fit, th e tw o are positively
relat ed; this is depicted by the t op function in Fig. 1.
Generalizability, on the other hand, is not related t o
complexity so stra ightforw ar dly. Ra ther, its function
follows t he same t rajectory as tha t of GOF up to a
certain point, a fter w hich fit declines as complexity
increases.
Why do th e tw o functions diverge? The da ta being
fitted ha ve a certa in degree of complexity th at reflects
th e opera tion of the cognit ive process. This point
corresponds to the peak of the genera lizability fun ction.
Any ad ditiona l complexity beyond tha t needed to
capture t he underlying process (to the right of the peak)
will cause the model to overfit the dat a by a lso
capturing the microvaria tion due to ra ndom error,
an d thu s reduce genera lizability. The reason both
functions overlap to the left ha lf of the peak is tha t t he
model itself must be sufficiently complex to fit t he
dat a well. The model will underfit the da ta if it lacks
the necessary complexity (i.e. it is t oo simple), a s
illustra ted in t he lower left gra ph in Fig. 1. The dilemma
tha t is faced in trying to maximize generalizabilityshould be clear : a delicat e bala nce must be struck
betw een sufficient complexity on the one hand a nd
good generalizability on t he other. MDL achieves th is
bala nce by choosing t he model wh ose complexity is
most justified by considering th e complexity of the
da ta relat ive to the complexity of the model.
Selection methods at work:the proof is in the pudding
Generalizabili ty measures like BMS and MDL have
thus fa r been developed for test ing only th ose models
tha t can be described in terms of a PROBABILITY DENSITY
FUNCTION. Exa mples of such sta tistical models are
models of categorizat ion and models of informa tion
int egra tion [14,15].
The followin g m odel-recovery t ests d emonstra te
the rela tive performan ce of the t hree classes of
selection m ethods sh own in Table 1. The t hree m odels
described (see Ta ble footn ote) w ere compar ed. In ea ch
simulation, 1000 dat asets were generated from each
model. Ea ch selection method w as t hen tested on its
abilit y to determine w hich of the th ree models did in
fact genera te the 3000 dat aset s. A good selection
method sh ould be able to discern correctly t he model
tha t generat ed the da ta . The ideal outcome is one in
wh ich each model generalizes best only to dat a
TRENDS in Cognitive Sciences Vol.6 No.10 October 2002
http://tics.trends.com
424 Opinion
TRENDS in Cognitive Sciences
Goodness of fit
Generalizability
Model complexity
Overfitting
Poor
Good
Mod
elfit
Fig. 1. Relationship between goodness of fit and generalizability as a function of model complexity.
The y-axis represents any fit index, where a larger value indicates a better fit (e.g. percent variance
accounted for). The three smaller graphs along the x-axis show how fit improves as complexity
increases. In the left graph, the model (represneted by the line) is not complex enough to match the
complexity of the data (dots). The two are well matched in complexity in the middle graph, which is
why this occurs at the peak of the generalizability function. In the right graph, the model is more
complex than the data, fitting random error. It has better goodness of fit, but is overfitting the data.
Table 1. Model recovery performance (percentage of
correct recoveries) for three models using three
selection methods
Selection
method
Model
fitted
Model the data were generated
from
M1
M2
M3
PVAF M1
0 0 0
M2
38 97 30
M3
62 3 70
AIC M1
79 0 0
M2
9 97 30
M3
12 3 70
MDL M1
86 0 0
M2
1 92 8
M3
13 8 92
Models M1, M2and M3were defined as follows: M1: y = (1+t)-a;
M2: y = (b+t)-a; M3: y = (1+bt)
-a. In the model equations, a, b and c
are parameters that were adjusted to fit each model to the data,
which were generated using the same five points,
t = 0.1, 2.1, 4.1 6.1, 8.1. Each sample of five observations was
sampled from the binomially probability distribution of size n = 50.
One thousand samples were generated from each model and
served as the data to fit. Each selection method was then used to
determine which model generated each of the samples.
The percentage of time each model was chosen for each dataset
is shown.
-
8/11/2019 02 Cuando Un Buen Ajuste Puede Ser Malo
5/5
Voice your Opinion in TICS
The pages of Trends in Cognitive Sciencesprovide a unique forum for debate for all cognitive scientists. Our Opinionsectionfeatures articles that present a personal viewpoint on a research topic, and the Letters page welcomes responses to any of the
articles in previous issues. If you would like to respond to the issues raised in this months Trends in Cognitive Sciencesor,alternatively, if you think there are topics that should be featured in the Opinionsection, please write to:
The Editor, Trends in Cognitive Sciences84 Theobalds Road, London,
UK WC1X 8RR,or e-mail: [email protected]
generated by itself. In the 33 ma trices in Table 1,
this corresponds t o perfect recovery in th e diagona l
going from the upper left to the lower right . Errors
(cells in th e off diagonals) reveal a bias in the selection
method t owa rd eith er the more or less complex model.
The top mat rix shows th e results using P VAF, wit h
th e percenta ge of correct recoveries in ea ch cell. The
problem with using a GOF measure shows up in the
first column of the mat rix, where the data were
genera ted by the one-para meter model M1. M
1never
recovered its own dat a, w ith one of the tw o-para meter
models alwa ys fit t ing the da ta best. Comparison of
these dat a w ith the middle mat rix shows that using
AIC r ectifies this problem. Becau se of its sensitivity to
the number of param eters in a m odel, it does a
reasona bly good job of distinguishing betw een dat a
generated by M1
or M2. By contra st , note how model
recovery performa nce remains consta nt a cross these
tw o mat rices in columns 2 a nd 3. This is not
surprising because AIC, like P VAF, is insensitive t o
functional form, wh ich is the dimension a long which
M2 and M3 differ. Only when MDL is used is animprovement in m odel recovery performa nce
observed in these columns (bottom ma trix).
Why did MDL not perform perfectly, and w hy did it
perform slightly worse tha n AIC and P VAF w hen the
dat a came from M2
(middle column)? Recall th at , like
a st at istical test, model selection is an inference
problem. The qua lity of th e inference depends on th e
informat ion tha t the selection method uses. Even
though MDL ma kes use of all of the informa tion
ava ilable (dat a and the model), this does not
guara ntee success, but it greatly improve the chan ces
of the inference being correct [16]. MDL will
outperform selection met hods such a s AIC an d P VAF
most of the time, but neither it nor its Ba yesian
counterpart, BMS , are infa llible.
The inferentia l na ture of model selection ma kes it
imperat ive to interpret m odel recovery result s in the
context of the data tha t w ere used in the test itself.
Poor performance might not indicate a problem with
the selection method, but ra ther a constra int on the
resolving power of the selection m ethod in
discriminating which model could ha ve generated the
da ta . Conversely, biases in the selection method itself
can m asq uera de as good model-recovery
performance. In a recent comparison of BMS and
RMSD , we dem onstra ted both of these outcomes [17].
Conclusion
Methods like AIC and MDL can give the impression
tha t model selection can be automat ed and require
minima l thought. H owever, choosing betw eencompeting m odels is no easier or less subjective th a n
choosing bet ween competing theories. Selection
methods should be viewed as a tool tha t researchers
can use to gain addit ional information about the
models under considera tion. These tools, however, are
ignoran t of other properties of the models, such a s
their plausibility or the qua lity of the data , so it is
ina ppropria te to decide between them solely on wh at
is lea rned from a test of generaliza bility.
TRENDS in Cognitive Sciences Vol.6 No.10 October 2002
http://tics.trends.com
425Opinion
References
1 J acobs, A.M. and G rainger, J . (1994) Models of
visual word recognitionsampling the sta te of the
art . J. Exp. Psychol. H um. Percept. Perform. 29,13111334
2 Smith, J .D. and Minda, J .P. (2000) Thirty
categorizat ion results in search of a model. J. Exp.
Psychol. L earn. M em. Cogn. 26, 327
3 Wichman, F.A. and Hill, N.J . (2001) The
psychometric function: I. Fitting, sam pling and
goodness of fit. Percept. Psychophys. 63, 12931313
4 Roberts, S. a nd P ashler, H. (2000) How persuasive
is a good fit? Acomment on th eory testing.
Psychol . Rev. 107, 358367
5 Bishop, C.M. (1995)Neural N etwork s for Pattern
Recognition, (Chapt ers 1 and 9), Oxford
University P ress
6 Myun g , I . J . et al . (2000) Specia l Issu e on model
selection. J. M ath. Psychol. 44, 12
7 Akaike, H. (1973) Informa tion theory and an
extension of the ma ximum likelihood principle. InSecond In tern ational Symposium on Inform ati on
Theory(Petrox, B .N. a nd C aski, F.), pp. 267281,
Akademiai Kiado, B udapest
8 Bur ha m, K.P. and Anderson, D.R. (1998)Model
Selection and I nference. A Practical I nform ation-
Theoretic Approach (Chapter 2), Springer-Verlag
9 Schwa rz, G. (1978) Estima ting the dimension of a
model. Ann. Stat. 6, 461464
10 Kas s, R.E. and Ra ftery, A.E. (1995) Ba yes factors.
J. Am. St at. Assoc.90, 773795
11 Rissanen, J . (1996) Fisher information and
stochastic complexity. IE EE Tran s. In form. Theor.
42, 4047
12 Ha nsen, M.H. a nd Yu, B. (2001) Model selection
an d the principle of minimum d escription length.
J. Am. Stat . Assoc. 96, 746774
13 Myung, I .J . and Pitt , M.A. (in press). Modelevalua tion, testing and selection. In H andbook of
Cognition(Lambert, K. and G oldstone, R. eds.), Sage
14 Nosofsky, R.M. (1986) Attent ion, similarity and
the identification-cat egorization relationship.
J. Exp. Psychol. Gen. 115, 3957
15 Oden, G.C. a nd Massa ro, D.W. (1978) Integra tion
of featura l informat ion in speech perception.
Psychol. R ev. 85, 172191
16 Pi t t , M.A. et al . (2002) Towa rd a meth od of
selecting a mong computationa l models of
cognition. Psychol . Rev. 109, 472491
17 Pi t t , M.A. et al . Flexibility versus generalizability
in model selection. Psychon. Bu ll . Rev. (in press)
Acknowledgement
Both authors were
supported by NIH Grant
MH57472.