Homework

Homework assignments will be posted here, in general organized by due date. Unless otherwise specified, parts of homework assignments that need to be handed in should be handed in via your personal Google Drive folder that only you and the instructor have access to.

Collaboration on homework is expected and encouraged, although you must write up your own assignment. No copying or cutting and pasting.

Due Wednesday 4/8/2015, 5pm

(30 pts) Lab 5.

Due Thursday 3/26/2015, 11am

(20 pts) Lab 4.

Due Thursday 3/5/2015, 11am

(30 pts) Conduct a simulation study that evaluates the inferential properties of least square estimates, using a simple or multiple linear regression data generation model. Your final write-up should define a linear regression data generation model (i.e. write down the formula for the model, and define all parameters). You should explicitly choose one or two parameters that you will systematically vary to and the quantitative metric that you will use to evaluate the estimates. (See examples below.) Your write-up should state the hypothesis that you had before running the simulation, present results from the simulation study, and evaluate whether your hypothesis was correct or not. You do not need to run a formal hypothesis test, just evaluate quantitatively and/or qualitatively how the performance varied across the parameterizations that you simulated. You should simulate from at least 10 different parameter sets (and probably at least 25 if you have two parameters). Your final write-up should not exceed 3 pages and should include 1 or 2 tables and/or figures showing the results of your simulation study that clearly capture the key trends you observed. You should be prepared to discuss your study design in class on Tuesday. Here are some examples of possible topics
- Examine the impact of the number of covariates on the MSE (or 95% confidence interval coverage) of a regression coefficient for one predictor variable.
- In an SLR setting, evaluate the degree to which the MSE or confidence interval coverage is impacted by non-constant variance of the residuals.
- Show how the power to detect a non-zero regression coefficient changes as a function of the sample size and/or the residual variance.
- Examine the average bias or MSE in estimating a regression coefficient if the residuals are drawn from a symmetric distribution with mean zero, but with increasing variance (e.g. a Cauchy or Student’s T distribution).

Due Thursday 2/26/2015, 11:30am

(30 pts) Lab 3.

Due Tuesday 2/17/2015, 5pm

(20 points) Revise and resubmit the report on the dataset that you handed in on 2/3/2015. The new report should include some of the information that you had written previously, should make improvements suggested from the earlier version, and have 1 or 2 additional multiple linear regression models. If appropriate, fit a polynomial term to capture non-linear relationships or use dummy variables to model categorical predictors. Interpret some of the MLR model coefficients in the context of your particular dataset. The report should be less than 6 pages, including all figures, and should be submitted as both PDF and Rmd formats.

Due Tuesday 2/10/2015, 5pm

(30 points) Complete Lab 2. Hand in a PDF and Rmd file via Google Drive. The final PDF file should be no more than 6 pages, including graphs.

Due Tuesday 2/3/2015, 5pm (PDF and Rmd files to be handed in via Google Drive)

(20 points) Create a short reproducible document (using knitr) that describes the basic structure of a dataset and summarizes some key features of the data using a few key tables and figures. Choose a dataset from these datasets, the ones in the class Google Drive, or some other dataset that interests you. Be sure to pick a dataset that has a continuous variable that you can use as an outcome variable in a linear regression model. Your write-up should address the following points:
- What is the background/context for this data?
- Data management: How many observations are there? Is the data tidy? What is the unit of observation?
- Data validation: Is there any missing data? If so, are there patterns to the missingness? Are there any obvious outliers in the data?
- Choose 4 to 10 key variables from your dataset (including the outcome variable). Include a codebook-style table that lists for each chosen variable the names, definitions, type of variable (i.e. categorical, continuous, binary), and the number of missing observations. Choose at least two of these variables and provide figures that show their univariate distributions. Describe the plotted distributions in words.
- Run simple linear regressions with two different predictor variables. Interpret the results. Plot a scatterplot of each regression and include the fitted line on the graph. Rescale your predictor variables if necessary to obtain a meaningful interpretation of beta0.
(10 points) Using R, create an example of Simpson’s paradox using simulated data, where you have a continuous outcome variable, one continuous x variable, and one categorical x variable. It is not necessary to fit regression models to show the paradox, but you should use several graphics to illustrate the slopes, as was shown in the slides for class 1. You should simulate your data using probability distributions with the R functions such as rnorm(), runif(), rpois(), etc… Every time your .Rmd file is re-knit, you should end up with different data, but the story should be the same.

Due Tuesday 1/27/2015

Read ISL Chapters 1 and 3.1.
Read Faraway Chapters 1 and 2.
Read through the syllabus.
(5 points for completing the test) Take CAOS test (Access code provided on Piazza)
(10 points) Hand in Problem 8 (part c is optional) from ISL Chapter 3 as a PDF file created using RMarkdown. You may use ggplot2 functions instead of base R graphics.
(5 points) Create a Google Drive folder named “[LastName]-[FirstName]-690NR” (e.g. “Reich-Nick-690NR”) and share it with me (nick at umass dot edu). You will use this folder to hand in homework assignments.
(Review) Brush up on creating data analysis reports using RMarkdown.
(Review) If you don’t know what ``tidy data’’ is, read about it.