Global F tests and Family-Wise Error Rates

Code for Biostatistics Methods 2, UMass-Amherst, Spring 2016

by Nicholas Reich

If you include a lot of predictor variables in a regression and are looking to evaluate the significance of many of them, you should consider using global F tests. Here is a simple example that illustrates why this test is important.

We start by picking a number of observations (\(N\)) and the number of parameters in our model, \(p\). We then generate \(p-1\) independent covariates, plus a column of 1s for the design matrix.

N <- 1000
p <- 100
x <- matrix(rnorm(N*p), nrow=N)
x <- data.frame(x)
colNames <- paste0("x", 1:p)
colnames(x) <- colNames

Now we will generate our \(y\)s completely independently of all of our covariates. None of our \(x\) variables are associated with our outcome!

y <- rnorm(N)

But if we fit a linear model that assumes that there ARE relationships bewteen our outcome and all of our \(x\) variables, do we see any individually significant \(\beta\)s? If so, how many are significant and are these indiciative of real associations? Let’s start by constructing a linear model formula that includes each of our \(x\) variables. We have suppressed the printing out of this formula in this write-up but if you run these two command below, you will see the formula that is created.

fmla <- formula(paste0("y ~ ", paste(colNames, collapse="+")))
fmla

Now we fit the model and evaluate how many of the individual \(\beta\) coefficients are significant at the \(\alpha=0.05\) level.

mlr1 <- lm(fmla, data=x)
coefs <- summary(mlr1)$coef
sum(coefs[,"Pr(>|t|)"]<.05)
## [1] 5

Alternatively, we could use a Global \(F\)-test to test whether any of the \(x\) variables add significant explanatory power to our model. We do this by fitting a “null” model that just includes an intercept. What conclusion do we draw from this test?

mlr0 <- lm(y ~ 1, data=x)
anova(mlr0, mlr1)
## Analysis of Variance Table
## 
## Model 1: y ~ 1
## Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
##     x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + 
##     x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + 
##     x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + 
##     x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + 
##     x52 + x53 + x54 + x55 + x56 + x57 + x58 + x59 + x60 + x61 + 
##     x62 + x63 + x64 + x65 + x66 + x67 + x68 + x69 + x70 + x71 + 
##     x72 + x73 + x74 + x75 + x76 + x77 + x78 + x79 + x80 + x81 + 
##     x82 + x83 + x84 + x85 + x86 + x87 + x88 + x89 + x90 + x91 + 
##     x92 + x93 + x94 + x95 + x96 + x97 + x98 + x99 + x100
##   Res.Df    RSS  Df Sum of Sq      F Pr(>F)
## 1    999 984.43                            
## 2    899 891.16 100     93.27 0.9409 0.6425

Do the results about the significance of the model coefficients from the Global \(F\)-test and the individual \(\beta\) \(t\)-tests agree? Why or why not?