If you include a lot of predictor variables in a regression and are looking to evaluate the significance of many of them, you should consider using global F tests. Here is a simple example that illustrates why this test is important.
We start by picking a number of observations (\(N\)) and the number of parameters in our model, \(p\). We then generate \(p-1\) independent covariates, plus a column of 1s for the design matrix.
N <- 1000
p <- 100
x <- matrix(rnorm(N*p), nrow=N)
x <- data.frame(x)
colNames <- paste0("x", 1:p)
colnames(x) <- colNames
Now we will generate our \(y\)s completely independently of all of our covariates. None of our \(x\) variables are associated with our outcome!
y <- rnorm(N)
But if we fit a linear model that assumes that there ARE relationships bewteen our outcome and all of our \(x\) variables, do we see any individually significant \(\beta\)s? If so, how many are significant and are these indiciative of real associations? Let’s start by constructing a linear model formula that includes each of our \(x\) variables. We have suppressed the printing out of this formula in this write-up but if you run these two command below, you will see the formula that is created.
fmla <- formula(paste0("y ~ ", paste(colNames, collapse="+")))
fmla
Now we fit the model and evaluate how many of the individual \(\beta\) coefficients are significant at the \(\alpha=0.05\) level.
mlr1 <- lm(fmla, data=x)
coefs <- summary(mlr1)$coef
sum(coefs[,"Pr(>|t|)"]<.05)
## [1] 5
Alternatively, we could use a Global \(F\)-test to test whether any of the \(x\) variables add significant explanatory power to our model. We do this by fitting a “null” model that just includes an intercept. What conclusion do we draw from this test?
mlr0 <- lm(y ~ 1, data=x)
anova(mlr0, mlr1)
## Analysis of Variance Table
##
## Model 1: y ~ 1
## Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
## x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
## x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 +
## x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 +
## x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 +
## x52 + x53 + x54 + x55 + x56 + x57 + x58 + x59 + x60 + x61 +
## x62 + x63 + x64 + x65 + x66 + x67 + x68 + x69 + x70 + x71 +
## x72 + x73 + x74 + x75 + x76 + x77 + x78 + x79 + x80 + x81 +
## x82 + x83 + x84 + x85 + x86 + x87 + x88 + x89 + x90 + x91 +
## x92 + x93 + x94 + x95 + x96 + x97 + x98 + x99 + x100
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 999 984.43
## 2 899 891.16 100 93.27 0.9409 0.6425
Do the results about the significance of the model coefficients from the Global \(F\)-test and the individual \(\beta\) \(t\)-tests agree? Why or why not?