Last year I introduced a new Data for Action: Data Science in Public Health course at the UW. The pilot course had 34 students, coming from a variety of backgrounds, not only public health. As with the development of all new courses, it took time to develop learning objectives and course content, including readings, lectures, exercises and assignments.
The biggest challenge: I couldn’t cover all the topics that I believe are important in this rapidly developing field of data science!Students needed to learn the core concepts of computer programming and coding for reproducible data analyses.
Students needed to have practical real-world examples of data analyses that span the different public health subdisciplines of epidemiology, laboratory sciences, health services, and environmental health.
Students also needed to have examples that could be applied to both quantitative and qualitative methods.
Utlimately, I developed a series of Reproducible Analytical Pipelines (RAPs) as exercises for the students. Each RAP was based on a common set of steps:
- Ingesting data
- Data quality checks and cleaning
- Data summarization
- Data visualization
- Modeling
- Interpreting results
Through these steps, it was possible to have discussions with the students about how data were collected, and what might be inherent issues with the data. We also discussed strategies for performing quality checks on the data, and how to handle missing or problematic data.
We also talked about data types, and strategies for analysis of continuous vs categorical, vs other data types like text, images, etc.
Students were provided small group exercises to think through analyses of hypothetical datasets, giving them opportunities e.g., to brainstorm the best kinds of data visualizations that help address a specific question they have about the data.
We also had a module on machine learning, in which students used a RAP build machine learning models for both regression and classification problems.
We had an exercise, in which students each presented demonstrations of data analysis packages/libraries that they had found meaningful for their work.
We also had a module related to data ethics, which included discussion of privacy and security, data and code sharing, and what it means for analyses to be robust and repeatable.
Although we covered machine learning, other aspects of Artificial Intelligence (AI) weren’t covered in depth, but some students’ demonstrations illustrated AI -based analyses. Given the rapid development of AI technologies, I feel that an entirely separate course could be devoted to more effective use of AI, especially prompt design and debugging for Large Language Models and deep learning methods and generative technologies for other media.