Course syllabus: Introduction to Machine Learning

From Research management course
Jump to: navigation, search

Practical introduction classes

1. Mathematics and Python for data analysis

Data analysis and machine learning rely heavily on results from calculus, linear algebra, optimization methods, and probability theory. To successfully apply data analysis methods, you need to be able to program. This course introduces the fundamental mathematical concepts required for data analysis and the basic skills of Python programming.

Machine learning and data analysis. What is Python and why was it chosen? How to install What are laptops and how to use them? Introduction to IPython Notebook. Data types. Loops, functions, generators, list comprehension. Reading data from files. Writing files, changing files. Working with files in Python Functions and their properties. Limit and derivative. The geometric meaning of the derivative. Derivative of a complex function. The problem of finding an extremum. Functions and extremes. Second derivative and convexity.

2. Python Libraries and Linear Algebra

This week we'll take a look at Python libraries that contain a wide range of useful tools, from fast operations on multidimensional arrays to visualization and implementation of various mathematical methods. In addition, we will master linear algebra, the basic mathematical apparatus for working with data: in most problems, data can be represented as vectors or matrices.

Pandas data frame. Pandas Indexing and selection. First introduction to NumPy, SciPy, and Matplotlib. Solving optimization problems in SciPy. Basic concepts of linear algebra. Vector spaces. Linear independence and dimension. Operations in vector spaces. Matrix operations. Rank and qualifier. Systems of linear equations. Solvability of systems of linear equations and ranks. Special types of matrices. Eigenvalues and vectors. NumPy: matrices and operations on them. Linear Algebra: Text Similarity and Function Approximation.

3. Optimization and matrix expansions

Techniques for optimizing system parameters to minimize costs or maximize prediction accuracy. Matrix expansions are used in the construction of regression models, to reduce the dimensionality of data, in recommender systems and text analysis.

Partial derivatives and gradient. Applying a gradient. Directional derivative. Tangent plane and linear approximation. The direction of the fastest growth. Optimization of non-smooth functions. Smoothness and gradient descent. Annealing simulation method. Genetic algorithms and differential evolution. Product decompositions of matrices, singular value decomposition. Repetition of linear algebra. Approximation by a matrix of lower rank. Relation between singular value decomposition and approximation by a matrix of lower rank. Optimization in Python: Global and non-smooth function optimization.

4. Probability

Basic concepts of probability theory and statistics are necessary to understand the mechanism of operation of almost all methods of data analysis. Popular distributions, statistics, parameter estimation, construction of confidence intervals.

Randomness in probability theory and statistics. Probability properties. Conditional Probability. Discrete random variables. Continuous random variables. Estimation of the distribution over the sample. Important characteristics of distributions. Distributions, parameters, and estimates. Central limit theorem. Confidence intervals.

5. Training on labeled data

Training on labeled data or supervised learning is the most common class of machine learning problems. It includes those tasks where you need to learn how to predict a certain value for any object, having a finite number of examples. The focus is on classification and regression algorithms successfully applied in practice: linear models, neural networks, decision trees, and so on. Construction of predictive algorithms. Evaluation of the generalizing ability of algorithms, selection of model parameters, selection and calculation of quality metrics.

Basic terms in machine learning. Learning on labeled data. Learning without a teacher. Types of problems in machine learning. Features in machine learning. Machine learning: tasks and signs. Linear models in regression problems. Gradient descent for linear regression. Stochastic gradient descent. Gradient Descent. Linear classification. Loss functions in classification problems. Linear regression and core Python libraries for data analysis and scientific computing. Linear regression and stochastic gradient descent.

6. Fighting overtraining and assessing quality

The problem of retraining is considered, because of what it arises, how it can be detected, and how to deal with it. Introduction to cross-validation, which can be used to evaluate the ability of an algorithm to make good predictions on new data. Next, we will talk about quality metrics - without them, it is impossible to understand whether the algorithm is suitable for solving a particular problem. Introduction to the scikit-learn library, which is one of the main tools for scikit.