When evaluating the fit of a statistical model, researchers often rely on the familiar coefficient of determination, denoted as R-squared, to quantify the proportion of variance explained by the predictors. Yet in many advanced modeling contexts, particularly with non-linear models, categorical outcomes, or models estimated using maximum likelihood, the standard R-squared formula loses its intended meaning or becomes impossible to calculate. This is where the concept of pseudo R-squared emerges as a critical diagnostic tool, offering a familiar metric of goodness-of-fit where the traditional R-squared is unavailable.
Understanding the Limitations of Standard R-Squared
R-squared is fundamentally defined as the ratio of the explained sum of squares to the total sum of squares, a calculation that assumes a linear relationship between the dependent and independent variables and an ordinary least squares (OLS) regression framework. In models such as logistic regression, probit regression, or Poisson regression, the dependent variable is not continuous, and the model is estimated using maximum likelihood rather than minimizing squared residuals. Because the total variance in the outcome is not partitioned in the same way as in OLS, the traditional R-squared cannot be computed, leaving a gap in the assessment of model performance that researchers are eager to fill.
Defining Pseudo R-Squared
Pseudo R-squared is a family of statistics designed to mimic the interpretability of R-squared for models that do not satisfy the assumptions of linear regression. While there is no single "official" definition, most pseudo R-squared measures attempt to compare the likelihood of the fitted model against the likelihood of a null model that contains no predictors. By establishing a baseline fit and comparing the improvement of the saturated model, these statistics provide a value between 0 and 1 (or sometimes outside this range) that indicates the strength of the relationship in the data. It is crucial to remember that these are not true R-squared values but rather analogs that serve a similar communicative purpose.
Common Calculation Methods
Several formulas are widely used to calculate pseudo R-squared, each with its own theoretical foundation and interpretation. The most prevalent include the likelihood ratio-based pseudo R-squared, which uses the log-likelihoods of the null and fitted models; the Cox and Snell R-squared, which attempts to maximize the analogy to the traditional R-squared but does not reach a maximum of 1; and the Nagelkerke R-squared, which is a rescaled version of the Cox and Snell to ensure the maximum value is 1. Other variants, such as the count R-squared, focus on the proportion of correctly predicted cases rather than likelihoods.