Final project
- Overview
- Purpose
- Learning Objectives
- Getting Started: The Power of Starting Simple
- Research Question Development
- Technical Requirements
- Project Deliverables
- Writing Guidelines: Focus on YOUR Specific Findings
- Project Grading
- Timeline
- Dataset and Resources
- Collaboration and Academic Integrity
- Questions?
- References
Overview
The opioid epidemic represents one of the most devastating public health crises in modern American history. In 2023 alone, over 105,000 Americans died from drug overdoses, with opioids involved in approximately 80,000 (76%) of these deaths.1 This crisis has evolved dramatically over the past two decades—from prescription painkillers in the early 2000s, to heroin in the 2010s, to synthetic opioids like fentanyl dominating the current landscape.2
For your final project, you will use data from the CDC WONDER database to tell a focused, data-driven story about one aspect of the opioid epidemic. Your goal is not to solve the entire crisis or provide comprehensive policy recommendations, but rather to use your data visualization and storytelling skills to reveal specific patterns, trends, or disparities that can inform our understanding of this complex issue.
This is an individual project, though students working on similar topics are encouraged to discuss ideas and share resources. The project emphasizes iterative development—you’ll start with a simple dataset and visualization, then progressively deepen your analysis through three checkpoint assignments leading to a final report and in-class examination.
Most of the grade for the final project (80%) will be earned by your performance on a written exam on the last day of class. This exam will involve short answer and short essay questions and will be designed to help you showcase the depth of your understanding of the work on your project. The remaining 20% will be earned by the content in the final report that will be due the day before the final exam.
Purpose
This project serves multiple important purposes in your development as a public health data scientist:
-
Integration: You’ll apply all the skills learned this semester—data manipulation, visualization, joining datasets, creating narratives—to a real-world public health problem.
-
Autonomy: Unlike guided assignments, you’ll formulate your own research question, find relevant data, and make analytical decisions independently.
-
Iteration: Through the checkpoint structure, you’ll experience how data analysis really works—starting simple, exploring, refining questions, and progressively building toward insights.
-
Communication: You’ll practice translating complex data patterns into clear, accessible stories for non-technical audiences (policymakers, practitioners, journalists).
-
Deep understanding: The significant weight on the in-class exam (80% of project grade) incentivizes you to truly understand your analysis, not just produce output. You’ll need to explain your choices, interpret your findings, and discuss limitations.
Learning Objectives
By completing this project, you will:
- Apply tidyverse data manipulation skills (
dplyr,tidyr) to real public health surveillance data - Create compelling visualizations (
ggplot2) that tell clear stories rather than simply displaying data - Integrate multiple datasets from different sources to answer complex questions
- Develop and refine research questions through iterative exploratory analysis
- Communicate technical findings to non-technical audiences with clarity and precision
- Practice reproducible research workflows using R Markdown
Getting Started: The Power of Starting Simple
Important: Do not try to answer big questions immediately. Start small.
Step 0: Watch a video about the CDC Wonder dataset
Prof Nick created this video which walks you through a query of the CDC Wonder dataset. It is 18 minutes long. It is required to watch this video. It provides step-by-step instructions for querying the data. It will save you time and extra work later on by watching it before you start.
Step 1: Download One Simple Dataset
Go to CDC WONDER (Multiple Cause of Death database) and download a simple dataset, following the instructions in the video above. One of the variables that you select to “Group results by” in the query should be year, so you start to see some results over time. For example:
- Opioid deaths over time in your home state using the “Current Final Multiple Cause of Death Data” (1999-2020)
- Opioid deaths by age group in the United States using the “Provisional Multiple Cause of Death Data” (2018 - Last Week)
- …
Keep it simple: 1-2 dimensions (like year and location, or age and year).
Step 2: Make One Plot
Create a basic visualization of this data. Here are a few simple examples to get you started, but you can try anything that you’d like here that you think makes sense.
- Line chart showing deaths over time
- Bar chart comparing age groups over time
- Some other plot that makes sense for your data
Step 3: Look closely
Spend time with this plot. Ask yourself questions like:
- What patterns do I see?
- What surprises me?
- What questions does this raise?
- What would I want to compare this to?
- What might explain these patterns?
Step 4: Generate Questions
Write down 3-5 questions that your initial plot raises. These become the seeds of research ideas for your project. Examples:
- “Why did deaths spike in 2015?”
- “Why is location X so different from location Y?”
- “Why do we see different patterns in younger vs. older age groups?”
- “How does my state compare to neighboring states?”
This simple start becomes Checkpoint 1. From there, you’ll expand and deepen your analysis.
Research Question Development
What Makes a Good Research Question?
Good questions are:
- Specific: Focus on particular places, times, demographics, or types of opioids
- Answerable: Can be addressed with available data
- Comparative: Look at differences across time, space, or groups
- Focused: Explore one clear pattern or relationship, not “everything about opioids”
Examples:
❌ Poor Questions (too broad, vague):
- “How bad is the opioid epidemic?”
- “What are the causes of opioid deaths?”
- “How can we solve the opioid crisis?”
✅ Good Questions (specific, answerable, focused):
- “How has the shift from prescription opioids to fentanyl affected mortality rates in rural vs. urban counties in New England between 2015-2023?”
- “Do states with prescription drug monitoring programs show different patterns in prescription opioid deaths compared to synthetic opioid deaths?”
- “How do age-specific opioid mortality patterns differ between Appalachian states and West Coast states?”
- “Has the urban-rural gap in opioid mortality rates widened or narrowed over the past decade?”
Suggested Angles to Explore (optional, not required):
Your research question could focus on:
- Geographic patterns:
- Urban vs. rural differences
- Regional trends (Appalachia, New England, West Coast, etc.)
- State-to-state comparisons
- County-level hotspots
- Demographic disparities:
- Age group patterns
- Gender differences
- Racial/ethnic disparities
- Evolution over time:
- Shifts from prescription → heroin → fentanyl
- Impact of policy changes
- Waves and phases of the epidemic
- Contextual factors (requires external data):
- Economic factors (unemployment, poverty, income)
- Healthcare access (provider density, Medicaid expansion)
- Policy environment (prescription monitoring, naloxone access)
Technical Requirements
Required Tools and Packages
You must use R and R Markdown for all project work. At minimum, you should use:
- tidyverse: dplyr (data manipulation), ggplot2 (visualization), tidyr (reshaping)
- DT or similar: For presenting data tables (optional but recommended)
Data Requirements
- Primary dataset: CDC WONDER opioid mortality data
- Multiple Cause of Death database: https://wonder.cdc.gov/mcd.html
- Focus on opioid-related ICD-10 codes (T40.0-T40.4, T40.6)
- External data integration:
- U.S. Census data (population, demographics)
- Bureau of Labor Statistics (unemployment, income)
- Healthcare access data
- State policy data (prescription monitoring programs, naloxone access laws)
- You must meaningfully integrate at least 1 dataset in addition to the CDC Wonder dataset
- Population adjustment: When comparing across geography, you almost certainly should use population-adjusted rates (deaths per 100,000 population), not raw counts. If the whole point of your analysis is to compare differences on an absolute scale between large and small states, then using rates might not be appropriate, but to make a meaningful comparison between locations with different population sizes you must use a population-adjusted rate.
Visualization Requirements
Your final project should include 4-5 high-quality visualizations that:
- Use appropriate visual encodings:
- Position (x/y axes) or other visual cues for quantitative comparisons
- Color used thoughtfully (not arbitrarily)
- Appropriate scales and transformations when needed
- Include at least one:
- Faceted visualization or small multiples
- Time series or temporal comparison
- Geographic comparison (maps optional but not required)
- Are simple but impactful:
- Clear, interpretable
- Minimal clutter
- Each figure has a clear point to make in your story
- All figures dynamically generated (no manual insertion)
- Optional enhancements:
- Interactive elements (plotly, DT tables)
- Animations (gganimate) if they enhance the story
- Ridge plots, density visualizations, or other advanced techniques
Code and Reproducibility
- All code must be included in R Markdown (.Rmd) file
- Use code folding (
code_folding: hide) in final HTML output for readability - Use relative file paths (not absolute paths like
/Users/yourname/...) - Include all data files with submission (or clear instructions for downloading)
- Code should run without errors when knitted
Project Deliverables
Each checkpoint will have its own assignment on Canvas, but they are described here briefly.
Checkpoint 1: Research Question and Initial Exploration
Due: Monday, November 17, 2025 at 10:00 PM
Requirements:
- Download and load one simple CDC WONDER dataset.
- Create one exploratory visualization.
- Generate 3-5 preliminary research questions based on what you observe.
- Select and clearly state 1 primary research question to pursue.
- Describe (1 paragraph) what additional dimensions of the CDC Wonder dataset and what external data sources you plan to use and why.
Deliverable: R Markdown HTML file with code folding, approximately 300-500 words
Purpose: This checkpoint ensures you start early and begin the iterative process. Don’t overthink it—just explore. You will receive feedback on this assignment which will leave you time to adjust plans or change course if you want to.
Checkpoint 2: Expanded Analysis and Draft Visualizations
Due: Sunday, November 24, 2025 at 10:00 PM
Requirements:
- Download a more complex version of the CDC WONDER data, with more dimensions than in Checkpoint 1. (Unless your Checkpoint 1 analysis was already using a lot of variables.)
- Integrate at least one external dataset with your CDC WONDER data.
- Create 2-3 draft visualizations that begin to address your research question.
- Write a brief analysis that provides one paragraph per visualization plus one paragraph summarizing the findings overall so far.
- Revise your research question if needed based on what you’re learning, discuss any planned updates to datasets.
- Outline your plan (1 paragraph) for additional analysis before the final report.
Deliverable: R Markdown HTML file with code folding, approximately 500-800 words
Purpose: You’re now deepening the analysis and starting to build your narrative arc.
Checkpoint 3: Near-Complete Draft
Due: Sunday, December 1, 2025 at 10:00 PM
Requirements:
- Near-complete analysis with all major visualizations (3-5 figures)
- Full draft narrative including:
- Introduction (context and research question)
- Methods (data sources, approach)
- Findings (what your visualizations show)
- Discussion (limitations and implications)
- All technical elements in place
Deliverable: R Markdown HTML file with code folding, approximately 1200-1500 words, 3-4 figures
Purpose: This is essentially a complete draft. The final week is for polishing and refinement, not major new work.
Final Report
Due: Monday, December 8, 2025 at 10:00 PM
Requirements:
- Polished, complete analysis
- 1500-2000 words
- 4-5 high-quality visualizations
- Professional write-up suitable for non-technical audience
- All code included with code folding enabled
Deliverables (submit all):
- R Markdown source file (.Rmd)
- Knitted HTML output
- All data files used (can be zipped together if multiple files)
Grading: This report is worth 20 points (20% of project grade)—see rubric below.
In-Class Exam
Date: Tuesday, December 9, 2025 (during final class session)
The exam will consist of questions about your specific project (60 points) and some general course content (20 points). Project-specific questions will focus on interpretation of your visualizations, explanation of technical decisions you made, discussion of alternative analyses you considered, identification of limitations in your approach, suggestions for future extensions, and “what if” scenarios about different data or methods. General course content questions will cover data manipulation concepts (tidyverse), visualization principles (grammar of graphics), reproducible research practices, and data storytelling concepts. The heavy weight on this exam (80% of project grade) incentivizes you to deeply understand your own work. You should be able to explain every choice, interpret every figure, and discuss the limitations of your analysis. This ensures the project is a genuine learning experience, not just output generation.
You may bring print-outs of two of your figures (the image only, no text) as a reference for answering questions during the exam.
Writing Guidelines: Focus on YOUR Specific Findings
What to AVOID:
Generic platitudes and broad conclusions like:
“In conclusion, the opioid epidemic is a complex, multifaceted issue that requires a comprehensive, coordinated response from the government, healthcare providers, and communities. The evidence presented in this analysis provides valuable insights into some of the factors driving the opioid crisis, and these insights can help us shape future public health policies that are aimed at reducing the devastating impact of opioids on American society.”
This tells the reader nothing specific about what you discovered in your analysis.
What to INCLUDE:
Specific findings from YOUR data with concrete details:
✅ Good example:
“Between 2015 and 2020, opioid mortality rates in rural New England counties increased 340% (from 12.3 to 54.2 deaths per 100,000), while urban counties in the same region saw a 180% increase (from 18.7 to 52.4 per 100,000). This rural acceleration coincided with the emergence of illicitly manufactured fentanyl, which by 2020 was involved in 87% of rural opioid deaths compared to 56% in 2015. This pattern suggests that rural areas may have been particularly vulnerable to the shift from prescription opioids to synthetic drugs, potentially due to limited access to harm reduction services and medication-assisted treatment.”
Key elements of good writing:
- Specific numbers from your analysis
- Clear comparisons that you made
- Patterns you discovered in your data
- Tentative explanations connected to your findings (use cautious language: “suggests,” “may indicate,” “is consistent with”)
- Limitations of your specific analysis (what you can’t conclude, what data you lack)
- Concrete next steps that would extend your specific work
Structure Your Writing:
Your report should follow a clear four-part structure. Start with an Introduction (~300 words) that provides brief context about the opioid epidemic (2-3 sentences), states your specific research question, explains why this question matters, and previews your key findings. This opening should draw readers in and make them care about your analysis.
The Methods section (~200-300 words) should describe your data sources with specifics including date ranges, geographic scope, and variables used. Document any data cleaning or processing steps you took, explain what external data you integrated and how, and provide a brief description of your analytical approach. This section gives readers confidence that your analysis is rigorous and reproducible.
Your Findings section (~600-800 words) is the heart of your report and should be organized around your 4-5 visualizations. Each figure should be introduced, displayed, and then interpreted with specific numbers and patterns from your data. Build a logical narrative progression that tells a coherent story rather than just presenting disconnected results.
Finally, the Discussion (~300-400 words) should summarize your specific key findings, acknowledge the limitations of your analysis (data gaps, methodological constraints, scope), discuss the implications of your specific findings, and suggest concrete next steps for future analysis. Avoid generic policy recommendations—stay focused on what your particular analysis reveals and what questions it raises.
Audience
Write for an educated non-expert: imagine your reader is a public health practitioner, policymaker, or journalist who is intelligent and interested in your topic but doesn’t have technical statistical training. This means you should define technical terms when you use them, explain why you made certain analytical choices (not just what you did), focus on interpretation and meaning rather than just methodological details, and use clear, jargon-free language throughout. Your goal is to make your findings accessible and compelling to someone who cares about public health but may not understand statistical notation or advanced analytical techniques.
Project Grading
Final Report: 20 points (20% of project grade)
Visualization Quality and Appropriateness (10 points) In general, the report should contain 4-5 high-quality figures that address your research question and form a story. The following list contains features that I expect from good data visualizations.
- Meaningful contribution to the coherent whole story that is being told
- Appropriate visual encodings for data types
- Effective use of faceting, color, scales
- Clear labels, legends, and captions
- Population-adjusted rates used where appropriate
Writing Quality and Style (5 points)
Your report should tell a clear, compelling story about your findings. Strong writing demonstrates thoughtful organization and communicates your insights effectively to a non-technical audience.
- Clear introduction with specific research question
- Logical progression of ideas
- Specific findings emphasized over generic statements
- Professional writing appropriate for educated non-expert audience
- Proper structure (introduction, methods, findings, discussion)
Technical Execution (5 points)
Your submission should demonstrate professional formatting and attention to technical details. A well-executed report is clean, easy to read, and free of distracting technical issues.
- HTML file submitted with proper formatting
- Code folding enabled
- No warning or error messages displayed in output
In-Class Exam: 80 points (80% of project grade)
Project-specific questions will focus on interpretation of your visualizations, explanation of technical decisions you made, discussion of alternative analyses you considered, identification of limitations in your approach, suggestions for future extensions, and “what if” scenarios about different data or methods. The heavy weight on this exam (80% of project grade) incentivizes you to deeply understand your own work. You should be able to explain every choice, interpret every figure, and discuss the limitations of your analysis. This ensures the project is a genuine learning experience, not just output generation.
Timeline
| Date | Deliverable | Grading |
|---|---|---|
| Monday, Nov 17, 10pm | Checkpoint 1: Research Question & Initial Exploration | 20 homework points |
| Sunday, Nov 24, 10pm | Checkpoint 2: Expanded Analysis & Draft Visualizations | 20 homework points |
| Sunday, Dec 1, 10pm | Checkpoint 3: Near-Complete Draft | 20 homework points |
| Monday, Dec 8, 10pm | Final Report | 20 project points |
| Tuesday, Dec 9 (in class) | In-Class Exam | 80 project points |
Note: Checkpoint assignments are designed to provide you with opportunities for feedback and guidance on your project. They will be graded as other homework assignments are, for completeness, and will be evaluated with short in-class assessments. The checkpoint assignments and assessments will count towards your overall assignment and assessment grade total, but will not count towards the 100-point total for your project grade.
Dataset and Resources
CDC WONDER
Access: https://wonder.cdc.gov/mcd.html
Tips for CDC WONDER:
- Select “Multiple Cause of Death, 1999-2020” or the “Provisional Multiple Cause of Death, 2018 - Last Week”.
- Group Results By: Choose your dimensions (Year, State, County, Age Group, etc.).
- ICD-10 Codes for Opioids: Under “MCD - Drug/Alcohol Induced Causes”. You could choose all or some of these, depending on your research question.
- T40.0: Opium
- T40.1: Heroin
- T40.2: Other opioids (includes natural and semi-synthetic)
- T40.3: Methadone
- T40.4: Synthetic opioids other than methadone (includes fentanyl)
- T40.6: Other and unspecified narcotics
- Population data: CDC WONDER includes population denominators for rate calculations for most downloaded datasets.
- Suppression: Cells with <10 deaths are suppressed for privacy. This may affect county-level analyses. Pay close attention to whether you have a lot of cells suppressed. Analyses can make adjustments if there is a lot of missing data, but it can make things more complicated.
External Data Sources
U.S. Census Bureau
- Population estimates: https://www.census.gov/programs-surveys/popest.html (note the tidycensus R package has a convenient way to access census data through R)
- American Community Survey (demographics, income, education): https://www.census.gov/programs-surveys/acs
Bureau of Labor Statistics
- Unemployment data: https://www.bls.gov/lau/
- Wage and income data: https://www.bls.gov/oes/
Health Resources & Services Administration
- Health Professional Shortage Areas: https://data.hrsa.gov/
- Provider data: e.g., data on providers from KFF at https://www.kff.org/state-category/providers-service-use/
State and Federal Policy Data
- Prescription Drug Monitoring Programs (PDMPs): varies by state (e.g., here is a link to the one from MA DPH)
- Naloxone access laws via PDAPS (Prescription Drug Abuse Policy System): https://pdaps.org/
- CDC Overdose Prevention data: https://www.cdc.gov/overdose-prevention/data-research/facts-stats/index.html
R Packages That May Be Useful
- tidyverse: Core data manipulation and visualization
- tidycensus: Easy access to Census data from R
- sf: For spatial data and mapping (optional)
- plotly: Interactive visualizations
- DT: Interactive tables
- gganimate: Animated visualizations (optional)
Collaboration and Academic Integrity
This is an individual project. Each student must formulate their own research question, conduct their own analysis, write their own report, and take the exam based on their individual work.
However, students working on similar topics are encouraged to collaborate in productive ways. You may discuss ideas and approaches with classmates, share resources and data sources, provide feedback on each other’s work, and troubleshoot technical issues together. This type of collaboration can enhance your learning and help you work through challenges more effectively.
What is NOT allowed: You may not share code directly (though discussing approaches is fine), write portions of each other’s reports, or have someone else conduct your analysis. The core analytical and writing work must be your own.
Generative AI Use for the Project
Generative AI tools (e.g., ChatGPT, Claude, etc.) can be valuable resources for this project when used appropriately, but they must not replace your own thinking and learning. For this project, you are expected to follow the GenAI use policies outlined in the course syllabus. Specifically:
Acceptable uses of AI for the project include:
- Brainstorming initial ideas for research questions (after developing your own ideas first)
- Getting help when stuck on a specific coding problem
- Editing and refining your writing for clarity and grammar
- Asking for explanations of technical concepts or jargon
- Generating study materials to prepare for the exam
Unacceptable uses of AI for the project include:
- Having AI write code for you without understanding what it does
- Submitting AI-generated data analysis as your own work
- Using AI to write substantial portions of your report
- Having AI summarize articles or data instead of reading/analyzing them yourself
- Making only minor adjustments to AI-generated content to pass it off as your own
Remember: The in-class exam (80% of your project grade) will test your deep understanding of your own work. If you rely too heavily on AI tools to complete your project without truly understanding the material, you will struggle on the exam. The exam will ask you to explain your visualizations, justify your technical decisions, discuss alternatives, and identify limitations—all from memory and without AI assistance. Your success depends on genuine engagement with your project work.
If in doubt, ask the instructor.
Questions?
This is a substantial project, but the checkpoint structure is designed to keep you on track and prevent last-minute stress. Start early, start simple, and iterate.
If you have questions about:
- Data access: Visit office hours or email with specific issues
- Research questions: Discuss in office hours—feedback is encouraged
- Technical problems: Office hours, Slack, or email
- Scope/expectations: Better to ask early than to stress later
Remember: The goal is not to solve the opioid epidemic, but to tell one clear, specific, data-driven story about one aspect of it. Focus and depth are more valuable than breadth.
Good luck, and I look forward to seeing what patterns and insights you discover!
References
-
CDC, National Center for Health Statistics. (2024). Drug Overdose Deaths in the United States, 2003–2023 (NCHS Data Brief No. 522). Retrieved from https://www.cdc.gov/nchs/products/databriefs/db522.htm ↩
-
Ahmad, F. B., Cisewski, J. A., Rossen, L. M., & Sutton, P. (2024). Provisional drug overdose death counts. National Center for Health Statistics. Retrieved from https://www.cdc.gov/nchs/nvss/vsrr/drug-overdose-data.htm ↩