adecuacion del modelo en reg

Upload: jairo-rueda

Post on 14-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Adecuacion Del Modelo en Reg

    1/58

    1

    Chapter 4 Model Adequacy Checking

    Ray-Bing Chen

    Institute of StatisticsNational University of Kaohsiung

  • 7/30/2019 Adecuacion Del Modelo en Reg

    2/58

    2

    4.1 Introduction

    The major assumptions:

    1. The relationship between y and xs is linear.

    2. The error term has zero mean.

    3. The error term has constant variance, 2

    4. The errors are uncorrelated

    5. The errors are normally distributed. 4. and 5. imply the errors are independent, and 5.

    for the hypothesis testing and C.I.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    3/58

    3

    Gross violations of the assumptions may yield an

    unstable model with opposite conclusions.

    The standard summary statistics: t-; F- statisticsand R2 can not detect the departures from the

    underlying assumptions.

    Based on the study of the model residuals.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    4/58

    4

    4.2 Residual Analysis

    4.2.1 Definition of Residuals

    Residual:

    The deviation between the data and the fit

    A measure of the variability in the response

    variable not explained by the regression model.

    The realized or observed values of the modelerrors.

    Plot residuals

    niyyeiii

    ,,1,

  • 7/30/2019 Adecuacion Del Modelo en Reg

    5/58

    5

    Properties:

  • 7/30/2019 Adecuacion Del Modelo en Reg

    6/58

    6

    4.2.2 Methods for Scaling Residuals

    Scaling Residuals is helpful in detecting the

    outliers or extreme values.

    Standardized Residuals:

  • 7/30/2019 Adecuacion Del Modelo en Reg

    7/58

    7

    Studentized Residuals:

    The residual vector:

    e=(I-H)y,

    where H=X(XX)-1X is the hat matrix.

    Since H is symmetric and idempotent, (I-H) is

    also symmetric and idempotent.Then

  • 7/30/2019 Adecuacion Del Modelo en Reg

    8/58

    8

    Hence

    Var(ei) = 2(1-hii)

    Cov(ei, ej) = -2hijStudentized residuals:

  • 7/30/2019 Adecuacion Del Modelo en Reg

    9/58

    9

    ri > di

    Constant variance, Var(ri) = 1

    ri and di may be little difference and oftenconvey equivalent information.

    PRESS Residuals:The prediction error (PRESS residuals)

    is the fitted value of the ith response basedon all observations except the ith one.

    From Appendix C7,

    )()(

    iii yye

    )(

    i

    y

    ii

    ii

    h

    ee

    1

    )(

  • 7/30/2019 Adecuacion Del Modelo en Reg

    10/58

    10

    The variance of the ith PRESS residual is

    A standardized PRESS residual:

    A studentized PRESS residual:

    iiii

    ii

    hh

    eVareVar

    1

    )

    1

    ()(2

    )(

    )1()( 2

    )(

    )(

    ii

    i

    i

    i

    h

    e

    eVar

    e

    )1()( Re)(

    )(

    iis

    i

    i

    i

    hMS

    e

    eVar

    e

  • 7/30/2019 Adecuacion Del Modelo en Reg

    11/58

    11

    R-Student:

    Estimate variance based on a data set with the

    ith observation removed.From Appendix C.8,

    R-student

    If the ith observation is influential, then candiffer significantly from MSRes , and thus R-

    student statistic will be more sensitive to this

    point.

    1

    )1/()( Re2)(

    pn

    heMSpnS iiisi

    )1(2)( iii

    ii

    hS

    et

    2

    )(iS

  • 7/30/2019 Adecuacion Del Modelo en Reg

    12/58

    12

    Example 4.1 The Delivery Time Data

    See Table 4.1

  • 7/30/2019 Adecuacion Del Modelo en Reg

    13/58

    13

    4.2.3 Residual Plot

    Graphical analysis is a very effective way to

    investigate the adequacy of the fit of a regressionmodel and to check the underlying assumption.

    Normal Probability Plot:If the errors come from a distribution with

    thicker or heavier tails than the normal, LS fit

    may be sensitive to a small subset of the data.

    Heavy-tailed error distributions often generateoutliers that pull LS fit too much in their

    direction.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    14/58

    14

    Normal probability plot: a simple way to check

    the normal assumption.

    Ranked residuals: e[1]< < e[n]Plot e[i] against Pi = (i-1/2)/n

    Sometimes plot e[i] against -1[ (i-1/2)/n]

    Plot nearly a straight line for large sample n >32 if e[i] normal

    Small sample (n

  • 7/30/2019 Adecuacion Del Modelo en Reg

    15/58

    15

  • 7/30/2019 Adecuacion Del Modelo en Reg

    16/58

    16

    Fitting the parameters tends to destroy the

    evidence of nonnormality in the residuals, and

    we cannot always rely on the normalprobability to detect departures from normality.

    Defect: Occurrence of one or two large

    residuals. Sometimes this is an indication that

    the corresponding observations are outliers.

    Example 4.2 The Delivery Time Data

    Fig 4.2 (a) The original LS residuals

    Fig 4.2 (b) The R-student residuals

    There may be one or more outliers in the data.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    17/58

    17

  • 7/30/2019 Adecuacion Del Modelo en Reg

    18/58

    18

    Plot of Residuals against the Fitted Values:

    From Fig 4.3:

    1. Fig 4.3a: Satisfactory

    2. Fig 4.3b: Variance is an increase function of y

    3. Fig 4.3c: Often occurs when y is a proportionbetween 0 and 1.

    4. Fig 4.3d: Indicate nonlinearity.

    For 2. and 3., use suitable transformations toeither the regressor or the response variable or

    use the method of weighted LS.

    For 4., except the above two methods, the

    other regressors are needed in the model.

    correlatedareandanded,uncorrelatareand iiii yeye

  • 7/30/2019 Adecuacion Del Modelo en Reg

    19/58

    19

  • 7/30/2019 Adecuacion Del Modelo en Reg

    20/58

    20

    Example 4.3 The Delivery Time Data

    Fig 4.4 (a)

    Fig 4.4 (b)

    Both plots do not exhibit any strong unusual

    pattern.

    Plot of Residuals against the Regressor:

    These plots often exhibit patterns such as those

    in Fig 4.3.

    In the simple linear regressor case, it is not

    necessary to plot residuals v.s. both fitted

    values and the regressors.

    iyv.s.ie

    iyv.s.it

  • 7/30/2019 Adecuacion Del Modelo en Reg

    21/58

    21

  • 7/30/2019 Adecuacion Del Modelo en Reg

    22/58

    22

    Example 4.4 The Delivery Time Data

    Fig 4.5(a): Plot R-student v.s. case

    Fig 4.5(b): Plot R-student v.s. distance

    Plot of Residuals in Time Sequence:

    The time sequence plot of residuals mayindicate that the errors at one time period are

    correlated with those at other time periods.

    Autocorrelation: The correlation between

    model errors at different time periods.

    Fig 4.6(a) positive autocorrelation

    Fig 4.6(b) negative autocorrelation

  • 7/30/2019 Adecuacion Del Modelo en Reg

    23/58

    23

  • 7/30/2019 Adecuacion Del Modelo en Reg

    24/58

    24

  • 7/30/2019 Adecuacion Del Modelo en Reg

    25/58

    25

    4.2.4 Partial Regression and Partial Residual Plots

    The plots in Section 4.2.3 may not completely

    show the correct or complete marginal effect of aregressor given the other regressors in the model.

    A partial regression plot(added variable plot or

    adjusted variable plot) is a variation of the plot

    of residuals v.s. the predictor.

    This plot can be used to provide information about

    the marginal usefulness of a variable that is not

    currently in the model. This plot consider the marginal role of xj given

    other regressors that are already in the model.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    26/58

    26

    Consider y = 0 + 1 x1 + 2 x2 + . We concern

    the relationship y and x1

    First regress y on x2, and obtain the fitted valuesand residuals:

    Then regress x1 on x2, and calculate the residuals:nixyyxye

    xxy

    iii

    ii

    ,,2,1),(

    )|(

    )(

    22

    2102

    nixxxxxe

    xxx

    iii

    ii

    ,,2,1),(

    )|(

    )(

    21121

    21021

  • 7/30/2019 Adecuacion Del Modelo en Reg

    27/58

    27

    Plot the y residuals ei(y|x2) against the x1 residuals

    ei(x1|x2).

    If the regressor x1 enters the model linearly, thenthe partial regression plot should show a linear

    relationship.

    If the partial regression plot shows a curvilinear

    band, then the higher-order terms in x1 or a

    transformation may be helpful.

    When x1 is a candidate variable begin considered

    for inclusion in the model, a horizontal bandindicates that there is no additional useful

    information in x1 for predicting y.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    28/58

    28

    Example 4.5 The Delivery Time Data

    Fig 4.7 (a) for x1

    Fig 4.7 (b) for x2The linear relationship between cases and

    distance is clearly evident in both of these plots.

    Obs. 9 falls somewhat off the straight line thatapparently well describes the rest of the data.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    29/58

    29

  • 7/30/2019 Adecuacion Del Modelo en Reg

    30/58

    30

    Some Comments on Partial Regression Plots:

    Partial regression plots need to be used with

    caution as they only suggest possible relationship

    between the regressor and the response. These

    plots may not give information about the proper

    form of the relationship if several variablesalready in the model are incorrectly specified. It

    will usually be necessary to investigate several

    alternate forms for the relationship between the

    regressor and y or several transformations.Residual plots for these subsequent models should

    be examined to identify the best relationship or

    transformation.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    31/58

    31

    Partial regression plots will not, in general, detect

    interaction effects among the regressors.

    The presence of strong multicollinearity can causepartial regression plots to give incorrect

    information about the relationship between the

    response and the regressor variables.

    It is fairly easy to give a general development of

    the partial regression plotting concept that shows

    clearly why the slope of the plot should be the

    regression coefficient for the variable of interest.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    32/58

    32

    Partial regression plot: e[y|X(j)] v.s. e[xj|X(j)]

    y = X + = X(j) + j xj +

    (I - H(j)) y = (I - H(j)) X(j) + j (IH(j)) xj +(I_H(j))

    Then (IH(j)) y = j (IH(j)) xj + (I_H(j))

    That is e[y|X(j)] = j e[xj|X(j)] + *

    This suggests that a partial regression plot shouldhave slope, j.

    Partial Residual plot:

    The partial residual for regressor xj

    ej are the residuals of full model.

    ijjiji xexye )|(*

  • 7/30/2019 Adecuacion Del Modelo en Reg

    33/58

    33

    4.2.5 Other Residual Plotting and Analysis Methods

    It may be very useful to construct a scatterplot of

    regressors xi v.s. xj. This plot may be useful instudying the relationship between regressor

    variables and the disposition of the data in the x-

    space.

    Fig 4.9 is a scatterrplot of x1 v.s. x2 for the

    delivery time data from Example 3.1

    The problem situation often suggests other types

    of residual plots. Fig 4.10: plot the residuals by sites.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    34/58

    34

  • 7/30/2019 Adecuacion Del Modelo en Reg

    35/58

    35

  • 7/30/2019 Adecuacion Del Modelo en Reg

    36/58

    36

    4.3 The Press Statistic

    The PRESS residuals:

    The PRESS statistic:

    PRESS is generally regarded as a measure of how

    well a regression model will perform in predictingnew data.

    Small PRESS

    )()(

    iii yye

    2

    11

    2)(

    1)(

    n

    i ii

    in

    i

    iih

    eyyPRESS

  • 7/30/2019 Adecuacion Del Modelo en Reg

    37/58

    37

    Example 4.6 Delivery Time Data

  • 7/30/2019 Adecuacion Del Modelo en Reg

    38/58

    38

    An R2- like statistic for prediction (based on PRESS):

    In Example 3.1,

    We expect this model to explain about 92.09% of

    variability in predicting new observations.

    Use PRESS to compare models: A model with

    small PRESS is preferable to one with large

    PRESS.

    Tediction SS

    PRESSR 12

    Pr

    9209.012Pr T

    ediction

    SS

    PRESSR

  • 7/30/2019 Adecuacion Del Modelo en Reg

    39/58

    39

    4.4 Detection and Treatment of Outliers

    An outlier is an extreme observation with larger

    residual (in absolute value) than others, say 3 or 4

    standard deviations from the mean. Outliers are data points that are not typical of the

    rest of the data.

    Identifying outliers: Residual plots against fitted

    values, normal probability plot, examining scaledresiduals (studentized and R-student residuals)

  • 7/30/2019 Adecuacion Del Modelo en Reg

    40/58

    40

    Bad values?? (A result of unusual but explainable

    event) Discard bad values!!

    An unusual but perfectly plausible observations Outliers may control many key model properties

    and may also point out inadequancies in the model.

    Various statistical tests have been proposed for

    detecting and rejecting outliers:

    Identifying the outliers based on the maximum

    normed residual

    n

    i

    ii ee1

    2/

  • 7/30/2019 Adecuacion Del Modelo en Reg

    41/58

    41

    Outliers: Keep or drop??

    Effect of outliers on regression may be checked

    by dropping these points and refitting theregression equation.

    t-, F-statistics, R2 and residual mean square

    may be very sensitive to the outliers.

    Situation in which a relatively small % of the

    data has a significant impact on the model may

    not be acceptable to the user of the regression

    equation. Generally we are happier about assuming that a

    regression equation is valid if it is not overly

    sensitive to a few observations.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    42/58

    42

    Example 4.7 The Rocket Propellant data

    Fig 4-11 and Fig 4-12

    From Fig 4-11, observation 5 and 6 should be theoutliers.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    43/58

    43

  • 7/30/2019 Adecuacion Del Modelo en Reg

    44/58

    44

    Deleting observation 5 and 6 has almost no effect

    on the estimate of the regression coefficients.

    A dramatic reduction in MSRes.

    A one-third reduction in the standard error of

    estimate of1.

    Observation 5 and 6 are not overly influential. No particular reason for unusually low propellant

    shear strengths obtained for observation 5 and 6.

    Should not discard these two points.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    45/58

    45

  • 7/30/2019 Adecuacion Del Modelo en Reg

    46/58

    46

    4.5 Lack of Fit the Regression Model

    4.5.1 A Formal Test for Lack of Fit

    Assume normality, independence, and constant

    variance. Only the simple linear relationship is in doubt.

    See Fig 4.13

    Requirement: have replicate observations on y forat least one level of x.

    True replication: Run n separate experiments at x.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    47/58

    47

  • 7/30/2019 Adecuacion Del Modelo en Reg

    48/58

    48

    These replicated observations are used to obtain a

    model-independent estimate of 2.

    There are ni observations on the response at the ithlevel of the regressor xi, i=1,2,,m.

    Let yij be the jth observation on the response at xi,

    i=1,2,,m and j =1, , ni.

    Hence there are n = (n1+ + nm) total

    observations.

    Partition SSRes into two components:

    SSRes = SSPE + SSLOF

    where

    m

    i

    iii

    m

    i

    n

    j

    m

    i

    n

    j

    iijiij yynyyyyi i

    1

    2

    1 1 1 1

    22 )()()(

  • 7/30/2019 Adecuacion Del Modelo en Reg

    49/58

    49

    The pure error sum of squares (model-independentmeasure of pure error)

    The degree of freedom for pure error:

    The sum of squares due to lack of fit with degreeof freedom m-2

    If the fitted values are closed to the correspondingaverage responses, then the regression functionshould be linear.

    m

    i

    n

    j

    iijPE

    i

    yySS1 1

    2

    )(

    m

    iiiiLOF yynSS 1

    2

    )

    (

    m

    i

    i mnn

    1

    )1(

  • 7/30/2019 Adecuacion Del Modelo en Reg

    50/58

    50

    The test statistic for lack of fit:

    If we conclude that the regression function is not

    linear, then the tentative model must be abandonedand attempts made to find a more appropriate

    equation.

    mnm

    PE

    LOF

    PE

    LOF FMS

    MS

    mnSS

    mSSF

    ,20 ~

    )/(

    )2/(

  • 7/30/2019 Adecuacion Del Modelo en Reg

    51/58

    51

    Even though F-ratio for lack of fit is not

    significant, and the hypothesis of significance of

    regression is rejected, this still does not guarantee

    that the model will be satisfactory as a prediction

    equation.

    The variation of the predicted values is large

    relative to the random error. The work of Box and Wetz (1973) suggests that

    the observed F ratio must be at least four or five

    times the critical value from the F table if the

    regression model is to be useful as a predictor.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    52/58

    52

    A simple measure of potential prediction

    performance: Compare the range of the fitted

    values, , to their average standard error.

    The average variance of the fitted values:

    Satisfactory predictor: the range of the fitted

    values ( ) is large relative to their

    average estimated standard error , where

    is a model-independent estimate of the errorvariance.

    minmax

    yy

    minmax yy

    np / 2

    2

  • 7/30/2019 Adecuacion Del Modelo en Reg

    53/58

    53

    Example 4.8 Testing for Lack of Fit (Data in Fig

    4.13)

  • 7/30/2019 Adecuacion Del Modelo en Reg

    54/58

    54

    i i f f i hb

  • 7/30/2019 Adecuacion Del Modelo en Reg

    55/58

    55

    4.5.2 Estimation of Pure Error from Near-Neighbors

    In above subsection,

    SSRes = SSPE + SSLOFSSPE is computed using responses at repeat

    observations at the same level of x. This is a

    model-independent estimate of

    2

    . Repeat observations on y at the same xsome of

    rows of X are the same!

    Daniel and Wood (1980) and Joglekar et al. (1989)

    have investigated methods for obtaining a model-independent estimate of error when there are no

    exact repeat points.

  • 7/30/2019 Adecuacion Del Modelo en Reg

    56/58

    56

    Near-neighbors: sets of observations that have

    been taken with nearly identical levels of x1, ,

    xk. Then the observations yi from such near-neighbors can be considered as repeat points and

    used to obtain an estimate of pure error.

    The weighted sum of squared distance (WSSD):

    Near-neighbors: small value of . That is pointsare relatively close together in x-space. If is

    large, the points are widely separated in x-space.

    k

    j s

    jiijj

    iiMS

    xxD

    1

    2

    Re

    '2

    '

    )(

    2'iiD

    2

    'iiD

  • 7/30/2019 Adecuacion Del Modelo en Reg

    57/58

    57

    The estimate is based on

    For samples of size 2,

    Algorithm:

    Arrange the data points xi1, , xikin order of

    increasing the predictor,

    Compute the values . Finally there are 4n-10

    values.

    Arrange the 4n-10 values in ascending order.

    Let Eu be the range of the residuals at thesepoints.

    'iii eeE

    EE 886.0)128.1(

    1

    iy

    2

    'iiD

  • 7/30/2019 Adecuacion Del Modelo en Reg

    58/58

    58

    Then the estimate of the standard deviation of

    pure error is

    Example 4.9 The Delivery Time Data

    Table 4.3

    Use 15 smallest values of

    In this case,

    If there is no appreciable lack of fit, we would

    expect

    Here

    Some lack of fit!!!

    m

    u

    uEm 1

    886.0

    2

    'iiD

    969.1

    sMSRe

    259.3Re sMS