Research Statement

Vadim, 2023

My research focuses on the problems of model selection in Machine Learning. It explores methods of Applied Mathematics and Computer Science. The central problem is selecting the most accurate and robust forecasting model of the simplest structure. To define the algebraic structure of this set according to the application and the origin of data, I use various tools: from tensor algebras to differential geometry. I use multivariate statistics, Bayesian inference, and graph probability to induce the quality criteria of selection. My work joins theory and practical applications. I believe multi-model decoding problems for heterogeneous data are the most promising. Forecasting the variable of a complex structure requires several models to recover dependencies in source and target spaces and to settle the forecast. The examples to investigate are various spatial-time measurements. The practical applications are brain-computer interface, health monitoring with wearable devices, and other signals in biology and physics.

Model complexity and dimensionality reduction

Model selection is the fundamental problem of Machine learning. Lasso and LARS [1,2] significantly impacted the statistical methods of forecasting. Time series data have multiple dependencies between components. It leads to redundancy of the design space. Our works show that the quadratic programming feature selection algorithms deliver less redundant and more stable models [3,4]. This approach considers the mutual information of the features and selects a model according to relevance and similarity criteria. The main idea is to minimize mutual dependence and maximize approximation quality by varying the indicator function. This function asserts the model structure. For linear models, we use the quadratic programming problem statement. For neural networks, we use Newtonian methods [5].

Tensor decomposition for model selection

We expand this approach for tensor decomposition [6] to reduce the dimen- sionality of brain signals [7]. We use a multi-way structure to predict hand trajectories from the spatial time series of cortical activity [8]. It resides in the spatial-spectra-temporal domains. Since these data highly correlate in the domains, redundancy of the feature space and its dimensionality become a major obstacle for a robust solution of the regression problem both in multi-way and flat cases. The proposed method extends the quadratic programming feature selection approach [9]. We do not have to optimize the model’s parameters to get an optimal model structure.

Bayesian model selection

Bayesian model selection relies on the analysis of the model parameters [10]. To select models, we shall use several types of parameters, including structure parameters and hyperparameters [11]. The first one defines the structure of the model, the computational graph. The second one represents the distribution of model parameters and infers the criterion to select the optimal model. The proper hyperparameter values prevent the model from overfitting. Optimization of neural networks with large amounts of hyperparameters is computationally expensive. We proposed modifications of various gradient-based methods based on evidence lower bound [12].

Hyperparameter optimization

A large part of my research is devoted to the methods of parameter covariance matrix estimations.These estimations form a criterion to select a particular parameter of the model or subset of parameters like a neuron or layer of a neural network. The hyperparameters here are the expectation and covariance matrix. For some types of models, they are estimated directly, but the following three ways dive decent results. The likelihood function optimization. Let us treat the model evidence or the evidence lower bound as an example. The bootstrapping methods tackle empirical estimations for some given models [13]. As the most productive way to get the hyperparameters, I treat stochastic processes, starting from Metropolis-Hastings Bayesian sampling.

Multimodeling and knowledge transfer

To reduce the complexity of deep learning models, we transfer information about the structure, parameters, and distribution from the teacher to the student model [14]. This method is called distillation or privileged learning. We assume the student has fewer parameters than the teacher [15]. Our method of Bayesian model selection treats the prior distribution of the student parameters as the posterior distribution of the teacher parameters [16]. We align the model structures by omitting non-informative parts of the teacher model for specific data.

Spatio-time series and manifold learning

The multiple and heterogeneous data sources require sophisticated forecasting models. These models come from the canonical correlation analysis [17]. To obtain a forecast, we make several models. First, we reconstruct embedding manifolds of all sources to reduce the dimensionality of all data spaces. Second, to align these sources in the latent space. Third, to construct the forecasting model as a superposition of particular models for each source [9]. We use convergent cross-mapping to guarantee that all these sources refer to the same dynamic system we model [18]. It brings us the Taken's theorem on delay embedding.

To reconstruct embedding manifolds, we use discrete and continuous time and space representation. In the discrete case, we use tensor decomposition [8]. Namely, tensor-train decomposition brings decent reconstruction quality and easily extends toward deep neural networks. For the continuous case, we use controlled differential equations. They reflect the continuous topology of neural network structure involving automatic differentiation methods. The future combination of these two cases significantly impacts the modeling of heterogeneous signals.

Some practical applications, like brain-computer interfaces, require graph representation of data. The graph describes functional groups of brain signals. It connects us to the metric structure of the space. It allows using methods from Riennamnian geometry, connecting the curvature of space and the graph structure.

To align time and dimensionality in the latent space, we use dynamic time warping and more prospective dynamic barycenter analysis. It brings to the research frontier the optimal transportation theory. We use flows to represent and align heterogeneous measurements of stochastic nature.

References

[1] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996.
[2] Bradley Efron et al. Least angle regression. The Annals of Statistics, 32(2), 2004.
[3] Alexandr Katrutsa and Vadim Strijov. Comprehensive study of feature selection methods to solve multicollinearity problem according to evaluation criteria. Expert Systems with Applications, 76:1–11, 2017.
[4] Alexandr Katrutsa and Vadim Strijov. Stress test procedure for feature selection algorithms. Chemometrics and Intelligent Laboratory Systems, 142:172–183, 2015.
[5] Roman Isachenko and Vadim Strijov. Quadratic programming optimization with feature selection for non-linear models. Lobachevskii Journal of Mathematics, 39(9):1179–1187, 2018.
[6] Tamara Kolda and Brett Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
[7] Julia Berezutskaya et al. Open multimodal iEEG-fMRI dataset from naturalistic stimulation with a short audiovisual film. Scientific Data, 9(1), 2022.
[8] Anastasia Motrenko and Vadim Strijov. Multi-way feature selection for ecog-based brain-computer interface. Expert Systems with Applications, 114(30):402–413, 2018.
[9] Roman Isachenko and Vadim Strijov. Quadratic programming feature selection for multi-correlated signal decoding with partial least squares. Expert Systems with Applications, 207:117967, 2022.
[10] Christoph Mark et al. Bayesian model selection for complex dynamic systems. Nature Communications, 9(1), 2018.
[11] Oleg Bakhteev and Vadim Strijov. Deep learning model selection of suboptimal complexity. Automation Remote Control, 79(8):1474–1488, 2018.
[12] Oleg Bakhteev and Vadim Strijov. Comprehensive analysis of gradient-based hyperparameter optimization algorithms. Annals of Operations Research, 1–15, 2020.
[13] Kuznetsov Tokmakova and Vadim Strijov. Analytic and stochastic methods of structure parameter estimation. Informatica, 27(3):607–624, 2016.
[14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, ArXiv, 2015.
[15] Andiy Grabovoy and Vadim Strijov. Probabilistic interpretation of the distillation problem. Automation Remote Control, 83(1):123–137, 2022.
[16] Andiy Grabovoy and Vadim Strijov. Prior distribution selection for a mixture of experts. Computational Mathematics and Mathematical Physics, 61(7):1149–1161, 2021.
[17] Wolfgang Hardle and Leopold Simar. Canonical correlation analysis. Applied Multivariate Statistical Analysis, 2007.
[18] George Sugihara et al. Fogarty, and Stephan Munch. Detecting causality in complex ecosystems. Science, 338(6106):496–500, 2012.

Navigation menu