Research Statement
Vadim, 2023
My research focuses on the problems of model selection in Machine Learning. It explores methods of Applied Mathematics and Computer Science. The central problem is selecting the most accurate and robust forecasting model of the simplest structure. To define the algebraic structure of this set according to the application and the origin of data, I use various tools: from tensor algebras to differential geometry. I use multivariate statistics, Bayesian inference, and graph probability to induce the quality criteria of selection. My work joins theory and practical applications. I believe multi-model decoding problems for heterogeneous data are the most promising. Forecasting the variable of a complex structure requires several models to recover dependencies in source and target spaces and to settle the forecast. The examples to investigate are various spatial-time measurements. The practical applications are brain-computer interface, health monitoring with wearable devices, and other signals in biology and physics.
Contents
Model complexity and dimensionality reduction
Model selection is the fundamental problem of Machine learning. Lasso and LARS [1,2] significantly impacted the statistical methods of forecasting. Time series data have multiple dependencies between components. It leads to redundancy of the design space. Our works show that the quadratic programming feature selection algorithms deliver less redundant and more stable models [3,4]. This approach considers the mutual information of the features and selects a model according to relevance and similarity criteria. The main idea is to minimize mutual dependence and maximize approximation quality by varying the indicator function. This function asserts the model structure. For linear models, we use the quadratic programming problem statement. For neural networks, we use Newtonian methods [5].
Tensor decomposition for model selection
We expand this approach for tensor decomposition [6] to reduce the dimen- sionality of brain signals [7]. We use a multi-way structure to predict hand trajectories from the spatial time series of cortical activity [8]. It resides in the spatial-spectra-temporal domains. Since these data highly correlate in the domains, redundancy of the feature space and its dimensionality become a major obstacle for a robust solution of the regression problem both in multi-way and flat cases. The proposed method extends the quadratic programming feature selection approach [9]. We do not have to optimize the model’s parameters to get an optimal model structure.
Bayesian model selection
Bayesian model selection relies on the analysis of the model parameters [10]. To select models, we shall use several types of parameters, including structure parameters and hyperparameters [11]. The first one defines the structure of the model, the computational graph. The second one represents the distribution of model parameters and infers the criterion to select the optimal model. The proper hyperparameter values prevent the model from overfitting. Optimization of neural networks with large amounts of hyperparameters is computationally expensive. We proposed modifications of various gradient-based methods based on evidence lower bound [12].
Hyperparameter optimization
A large part of my research is devoted to the methods of parameter covariance matrix estimations.These estimations form a criterion to select a particular parameter of the model or subset of parameters like a neuron or layer of a neural network. The hyperparameters here are the expectation and covariance matrix. For some types of models, they are estimated directly, but the following three ways dive decent results. The likelihood function optimization. Let us treat the model evidence or the evidence lower bound as an example. The bootstrapping methods tackle empirical estimations for some given models [13]. As the most productive way to get the hyperparameters, I treat stochastic processes, starting from Metropolis-Hastings Bayesian sampling.
Multimodeling and knowledge transfer
To reduce the complexity of deep learning models, we transfer information about the structure, parameters, and distribution from the teacher to the student model. This method is called distillation or privileged learning. We assume the student has fewer parameters than the teacher. Our method of Bayesian model selection treats the prior distribution of the student parameters as the posterior distribution of the teacher parameters. We align the model structures by omitting non-informative parts of the teacher model for specific data.
To reduce the complexity of deep learning models, we transfer information about the structure, parameters, and distribution from the teacher to the student model [14]. This method is called distillation or privileged learning. We assume the student has fewer parameters than the teacher [15]. Our method of Bayesian model selection treats the prior distribution of the student pa-rameters as the posterior distribution of the teacher parameters [16]. We align the model structures by omitting non-informative parts of the teacher model for specific data.
Spatio-time series and manifold learning
The multiple and heterogenous data sources require sophisticated forecasting models. These models come from the canonic correlation analysis. To obtain a forecast, we make several models. First, we reconstruct embedding manifolds of all sources to reduce the dimensionality of all data spaces. Second, to align these sources in the latent space. Third, to construct the forecasting model as a superposition of particular models for each source. We use convergent cross-mapping to guarantee that all these sources refer to the same dynamic system we model. It brings us the Taken's theorem on delay embedding.
To reconstruct embedding manifolds, we use discrete and continuous time and space representation. In the discrete case, we use tensor decomposition. Namely, tensor-train decomposition brings decent reconstruction quality and easily extends toward deep neural networks. For the continuous case, we use controlled differential equations. They reflect the continuous topology of neural network structure involving automatic differentiation methods. The future combination of these two cases significantly impacts the modeling of heterogeneous signals.
Some practical applications, like brain-computer interfaces, require graph representation of data. The graph describes functional groups of brain signals. It connects us to the metric structure of the space. It allows using methods from Riennamnian geometry, connecting the curvature of space and the graph structure.
To align time and dimensionality in the latent space, we use dynamic time warping and more prospective dynamic barycenter analysis. It brings to the research frontier the optimal transportation theory. We use flows to represent and align heterogeneous measurements of stochastic nature.