BE/Bi 103 a: Introduction to Data Analysis in the Biological Sciences
Modern biology is a quantitative science, and biological scientists need to be equipped with tools to analyze quantitative data. This course takes a hands-on approach to developing these tools. Together, we will analyze real data. We will learn how to organize, preserve, and share data sets, create informative interactive graphical displays of data, process images to extract actionable data, and perform basic resampling-based statistical inferences.
Importantly, biological data is often “messy” and there is no one right way to perform an analysis or make a plot. As we work with data, we will discuss various approaches to get a feel for the art of biological data analysis.
The sequel to this course goes deeper into statistical modeling, mostly from a Bayesian perspective. This course is foundational for that and further studies in analysis of biological data.
If you are enrolled in the course, please read the Course policies. We will not go over them in detail in class, and it is your responsibility to understand them.
Useful links
Ed (used for course communications)
Homework solutions (password protected)
People
Instructor
Justin Bois (bois at caltech dot edu)
TAs
David Goertsen (
dgoertsen AT caltech DOT edu
)Kayla Jackson (
kaylajac AT caltech DOT edu
)Grace Liu (
graceliu AT caltech DOT edu
)Zack Martinez (
zmartine AT caltech DOT edu
)Anastasiya Oguienko (
oguienko AT caltech DOT edu
)
Lessons
- 0. Preparing computing resources for the course
- 1. The cycle of science
- 2. Version control with Git
- 3. Introduction to Python
- 4. Style
- 5. Test-driven development
- 6. Exploratory data analysis, part 1
- E1. To be completed after lesson 6
- 7. Exploratory data analysis, part 2
- 8. Data file formats
- 9. Data storage and sharing
- 10. Data wrangling
- E2. To be completed after lesson 10
- 11. Intro to probability
- 12. Random number generation
- 13. Probability distributions
- E3. To be completed after lesson 13
- 14. Plug-in estimates and confidence intervals
- 15. Nonparametric inference with hacker stats
- E4. To be completed after lesson 15
- 16. Null hypothesis significance testing
- 17. Hacker’s approach to NHST
- E5. To be completed after lesson 17
- 18. Parametric inference
- 19. Numerical MLE
- E6. To be completed after lesson 19
- 20. Variate-covariate modeling
- 21. Confidence intervals of MLEs
- E7. To be completed after lesson 21
- 22. Reproducible workflows
- 23. The paper of the future
- 24. Implementation of MLE for variate-covariate models
- E8. To be completed after lesson 24
- 25. Mixture models
- 26. Model assessment
- 27. Implementation of model assessment
- E9. To be completed after lesson 27
- 28. Statistical watchouts
Recitations
- R1. The command line and Git
- R2. Intro to image processing
- R3. Manipulating data frames
- R4. Probability review
- R5. Overplotting
- R6. Dashboards
- R7. Topics in bootstrapping
- R8. Review of maximum likelihood estimation
- R9. Best Practices when using the Resnick High Performance Computing Center and other related topics
Homework
- 0. Configuring your team
- 1. Practice with Python tools and EDA I
- 2. Exploratory data analysis II
- 3. Wrangling, EDA III, and Normal approximations
- 4. Working with probability distributions
- 5. Nonparametric hacker stats
- 6. Maximum likelihood estimation I
- 7. Maximum likelihood estimation II
- 8. Maximum likelihood estimation III
- 9. Model assessment
- 10. Course feedback