囯外对三种统计软件的比较
Differences Between Statistical Software Packages
( SAS, SPSS, and MINITAB ) As Applied to Binary Response Variable Ibrahim Hassan Ibrahim Assoc. Prof. Of Statistics Dept., of Stat., & Math. Faculty of Commerce, Tanta University “I think that, in general, software houses need to provide clearer, more detailed, and especially more specific descriptions of what their calculations are. It is true that software developers are entitled to feel that they should not have to write textbooks. But it is also true that computing usage is getting easier, cheaper, faster, and more widespread, with statistical novitiates making more and more use of complicated procedures. Anything we can all do to guard against ridiculous use of these procedures has got to be worthwhile.” (Searle, S. R., 1994) 1. INTRODUCTION AND REVIEW OF LITRATURES Several writers have recently reviewed statistical software for microcomputers and offered very useful comments to both users and vendors. Some of these reviews are comprehensive and general (Searle, S. R. (1989). Some others analyze specific program features and identify problem areas. For example, Gerard E. Dallal (1992) published a very concise paper through the American Statistician titled “The computer analysis of factorial experiments with nested factors”. Dallal used two different computing packages SAS, and SPSS to analyze unbalanced data from fixed models with nested factors. Dallal found differences between SAS and SPSS results beside some error of calculations of sums of squares in SPSS output. Followed by Dallal, several commentaries were sent to the editors of the American Statistician trying to explain the discrepancies between SAS and SPSS results. This controversy on Dallal’s paper was ended by Searle, S. R. (1994) who presented a theoretical clarification of what could be the basic cause of differences and error of results. Searle ended his paper not by a conclusion but by a prayer to all software houses asking them to provide more clearer, more detailed, and more specific descriptions of their calculations. Okunade, A., and others (1993) compared the output of summary statistics of regression analysis in commonly statistical and econometrical packages such as SAS, SPSS, SHAZM, TSP, and BMDP. Oster, R. A. (1998) reviewed five statistical software packages (EPI INFO, EPICURE, EPILOG PLUS, STATA, and TRUE EPISTAT) according to criteria that are of most interest to epidemiologists, biostatisticians, and others involved in clinical research. McCullough B. D. (1998) proposed testing the accuracy of statistical software packages using Wilkinson’s Statistics Quiz in three areas: linear and nonlinear estimation, random number generation, and statistical distributions. Then, McCullough B. D. (1999) applied his methodology to the statistical packages SAS, SPSS, and S-Plus. McCullough concluded that the reliability of statistical software cannot be taken for granted because he found some weak points in all random number generators, the S-plus correlation procedures, and the one-way ANOVA and nonlinear least squares routines of SAS and SPSS. Zhou, X., and others (1999) reviewed five software packages that can fit a generalized linear mixed model for data with more than a two-level structure and a multiple number of independent variables. These five packages are MLn, MLwiN, SAS Proc Mixed, HLM, and VARCL. The comparison between these packages were based upon some features such as data input and management, statistical model capabilities, output, user friendliness, and documentation. Bergmann, R., and others (2000) Compared 11 statistical packages on a real dataset. These packages are SigmaStat 2.03, SYSTAT 9, JMP 3.2.5, S-Plus 2000, STATISTICA 5.5, UNISTAT 4.53b, SPSS 8, Arcus Quickstat 1.2, Stata 6, SAS 6.12, and StatXact 4. They found that different packages could give very different outcomes for the Wilcoxon-Mann-Whitney test. The purpose of this paper is to compare three statistical software packages when applied to a binary dependent variable. These packages are SAS (Statistical Analysis System), SPSS ( Statistical Package for the Social Sciences or Superior Performing Statistical Software as the SPSS company claims now), and MINITAB. The three packages are chosen because they are well known and most frequently used by statisticians or by others for commercial applications or scientific research. Real dataset in the field of medical treatments is used to test if there is a significant difference between two alternative drugs, test and reference drugs, on plasma levels of ciprofloxacin at different times. The binary response variable is “Drug”, which is zero for test drug, and one for reference drug, and the times 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 6.0, and 8.0 are the predictor variables. 2. STATISTICAL TREATMENT OF BINARY RESPONSE VARIABLE In many areas of social sciences research, one encounter dependent variables that assume one of two possible values such as presence or absence of a particular disease; a patient may respond or not respond to a treatment during a period of time. The binary response analysis models the relationship between a binary response variable and one or more explanatory variables. For a binary response variable Y, it assumes: g(p) = ’x … (1) Where p is Prob(Y=y1) for y1 as one of two ordered levels of Y, is the parameter vector, x is the vector of explanatory variables, and g is a function of which p is assumed to be linearly related to the explanatory variables. The binary response model shares a common feature with a more general class of linear models that a function g = g() of the mean of the dependent variable is assumed to be linearly related to the explanatory variables. The function g(), often referred as the link function, provides the link between the random or stochastic component and the systematic or deterministic component of the response variable. To assess the relationship between one or more predictor variables and a categorical response variable the following techniques are often employed: (i) Logistic regression (ii) Probit regression (iii) Complementary log-log 2.1 Logistic regression Logistic regression examines the relationship between one or more predictor variables and a binary response. The logistic equation can be used to examine how the probability of an event changes as the predictor variables change. Both logistic regression and least squares regression investigate the relationship between a response variable and one or more predictors. A practical difference between them is that logistic regression techniques are used with categorical response variables, and linear regression techniques are used with continuous response variables. Both logistic and least squares regression methods estimate parameters in the model so that the fit of the model is optimized. Least squares minimize the sum of squared errors to obtain parameter estimates, whereas logistic regression obtains maximum likelihood estimates of the parameters using an iterative-reweighted least squares algorithm (McCullagh, P., and Nelder, J. A., 1992). For a binary response variable Y, the logistic regression has the form: Logit(p) = loge [ p/(1-p) ] = ’x … (2) or equivalently, p = [ exp(’x) ] / [ 1 + exp(’x) ] … (3) The logistic regression models the logit transformation of the ith observation’s event probability; pi, as a linear function of the explanatory variables in the vector xi . The logistic regression model uses the logit as the link function. 2.2 Probit regression Probit regression can be employed as an alternative to the logistic regression in binary response models. For a binary response variable Y, the probit regression model has the form: Φ-1(p) = ’x … (4) or equivalently, p = Φ (’x) … (5) Where Φ-1 is the inverse of the cumulative standard normal distribution function, often referred as probit or normit, and Φ is the cumulative standard normal distribution function. The probit regression model can be viewed also as a special case of the generalized linear model whose link function is probit. 2.3 Complementary log-log The complementary log-log transformation is the inverse of the cumulative distribution function F-1(p). Like the logit and probit model, the complementary log-log transformation ensures that predicted probabilities lie in the interval [0,1]. If probability of success is expressed as a function unknown parameters i.e., pi = 1 – exp{-exp( k kxik )} … (6) Then the model is linear in the inverse of the cumulative distribution function, which is the log of the negative log of the complement of pi, or log{-log(1-pi)}, where log{-log(1-pi)}= k kxik … (7) In general, there are three link functions that can be used to fit a broad class of binary response models. These functions are : (i) the logit, which is the inverse of the cumulative logistic distribution function (logit), (ii) the normit (also called probit), the inverse of the cumulative standard normal distribution function (normit), and (iii) the gompit (also called complementary log-log), the inverse of the Gompertz distribution function (gompit). The link functions and their corresponding distributions are summarized in Table-1: TABLE-1 The Link Functions Name Link Function Distribution Mean Variance Logit g(pi) = loge { pi/(1-pi) } Logistic 0 p2 / 3 Normit (probit) g(pi) = Φ-1 (pi) Normal 0 1 Gompit (Complementary log-log) g(pi) = loge {-loge (1-pi) } Gompertz - (Euler constant) p2 / 6 We can choose a link function that results in a good fit to our data. Goodness-of-fit statistics can be used to compare fits using different link functions. An advantage of the logit link function is that it provides an estimate of the odds ratios. |
|