Statistical machine learning
Statistical machine learning
Academic year 2020/2021
- Course ID
- Silvia Montagna
- 2nd year
- Teaching period
- To be defined
- D.M. 270 TAF C - Related or integrative
- Course disciplinary sector (SSD)
- SECS-S/01 - statistica
- Formal authority
- Type of examination
- MAT0035 Statistical Inference
MAT0041 Multivariate Statistical Analysis
Good knowledge of R is required
Sommario del corso
The course introduces methods and models to extract important patterns and trends from big amount of data, and presents basic concepts of machine learning and data mining from a statistical perspective. Topics covered include modern regression, classification, cross validation, model selection and regularisation, and tree-based methods, among others. The course emphasizes selection of appropriate methods and justification of choice, use of programming for implementation of the method, and evaluation and effective communication of results in data analysis reports.
Results of learning outcomes
Knowledge and understanding
- Advance knowledge of parametric and nonparametric models for prediction and classification
Applying knowledge and understanding
- Ability to convert various problems and data into statistical models to perform prediction/classification
- Students will be able to discern the different aspects of statistical learning in modern settings
- Students will properly use statistical language to comunicate the results of their findings
- The acquired skills will give students the opportunity to improve and deepen their knowledge of statistical modeling
Introduction to Statistical Learning:
- Context and motivations
- Trade-off between goodness-of-fit and model complexity (i.e., variance and bias)
- Training and test set
- Exploratory data analysis
- Simple & multiple linear regression
- Residual analysis & model checking
Classification: Logistic regression; Multinomial logit/probit regression
Resampling methods: Cross-validation, bootstrap
- Shrinkage methods
- Dimension reduction methods
- Polynomial regression
- Step functions
- Splines, smoothing splines, thin-plate splines
- Generalised additive models
- Regression & classification trees
- Bagging, boosting, random forests
Support vector machines
Introduction to neural networks & deep learning:
- (Single- and) multi-hidden layers back-propagation networks
- Issues in training neural networks
- Convolutional neural networks
The course is composed of 48 hours of lectures which (for the AY 2020/2021) will be held remotely, either as live streaming or pre-recorded. All lectures will be recorded and made available on Moodle in due time, together with slides and other course material.
Two thirds of the lectures are devoted to the methodological/theoretical aspects of statistical machine learning, with reproducible examples given to support the understanding of methods. The remaining lectures are devoted to their practical implementation in R. Students are free to use any programming language; however, R is the officially supported language for this course. Students will be able to use RMarkdown for creating HTML and pdf documents.
Learning assessment methods
Until the end of the Covid-19 emergency (and including the September 2020 session), the written exam will be held remotely via Webex with video surveillance. More specific instructions will be given to students registered to the exam via their institutional email addresses.
1) Winter (January/February) exam session: your final grade will be based on a weighted average of a data analysis project (60%) and a closed-books written exam on theory (40%). For the data analysis, students can work individually or in teams (max 3 people) on a project of their choosing. Each student/team will have to submit a project report due on Friday, December 18. A team's report will be reviewed by another group. The review process will be "double blind", meaning that both the project report and review report will be anonymous.
For students who have failed to submit their data analysis (in total or in part), case (2) below applies.
2) Summer and Fall exam sessions: the final exam consists of a long written test (4 hours) on theory (50%) and data analysis in R (50%).
Scrivi testo qui...
Write text here...
Suggested readings and bibliography
The main books used during the course are:
- HASTIE, TIBSHIRANI AND FRIEDMAN. The elements of statistical learning: data mining, inference and prediction. Springer-Verlag
- JAMES, WITTEN, HASTIE, TIBSHIRANI. An introduction to statistical learning with applications in R. Springer. Provides a nice introduction to the field of statistical machine learning for non-mathematical sciences
Slides for the course will be provided. If you see any typos in my notes (no matter how small), please tell me about them! Doing so will not only benefit you, but also myself, your classmates and any future students of this course.
There are many books written on machine learning, and new books keep appearing all the time. These books can approach the field from different perspectives (e.g., statistics, computer science, probability). Here are links to a few additional resources:
Electronic communication: I will occasionally send e-mails to the class (to the account listed for you in the SDS directory), so please check that account regularly.
- Enrollment opening date
- 01/09/2020 at 00:00
- Enrollment closing date
- 30/06/2021 at 00:00