# Optimizers¶

## Overview¶

The scikit-learn library provides functionality for training linear models and a large number of related tools. The present module provides simplified interfaces for various linear model regression methods. These methods are set up in a way that work out of the box for typical problems in cluster expansion and force constant potential construction, including slight adjustments to scikit-learn default values. If you need more flexibility, extended functionality or the ability to fine-tune parameters that are not included in this interface, it is possible to use scikit-learn directly.

The most commonly used fit methods in the present context are LASSO, automatic relevance determination regression (ARDR), recursive feature elimination with $$\ell_2$$-fitting (RFE-L2) as well as ordinary least-squares optimization (OLS). Below follows a short summary of the main algorithms. More information about the available linear models can be found in the scikit-learn documentation.

### Least-squares¶

Ordinary least-squares (OLS) optimization is providing a solution to the linear problem

$\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y},$

where $$\boldsymbol{A}$$ is the sensing matrix, $$\boldsymbol{y}$$ is the vector of target values, and $$\boldsymbol{x}$$ is the solution (parameter vector) that one seeks to obtain. The objective is given by

$\left\Vert\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\right\Vert^2_2$

The OLS method is chosen by setting the fit_method keyword to least-squares.

### LASSO¶

The least absolute shrinkage and selection operator (LASSO) is a method for performing variable selection and regularization in problems in statistics and machine learning. The optimization objective is given by

$\frac{1}{2 n_\text{samples}} \left\Vert\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\right\Vert^2_2 + \alpha \Vert\boldsymbol{x}\Vert_1.$

While the first term ensures that $$\boldsymbol{x}$$ is a solution to the linear problem at hand, the second term introduces regularization and guides the algorithm toward finding sparse solutions, in the spirit of compressive sensing. In general, LASSO is suited for solving strongly underdetermined problems.

The LASSO optimizer is chosen by setting the fit_method keyword to lasso. The $$\alpha$$ parameter is set via the alpha keyword. If no value is specified a line scan will be carried out automatically to determine the optimal value.

Parameter

Type

Description

Default

alpha

float

controls the sparsity of the solution vector

None

### Automatic relevance determination regression (ARDR)¶

Automatic relevance determination regression (ARDR) is an optimization algorithm provided by scikit-learn that is similar to Bayesian Ridge Regression, which provides a probabilistic model of the regression problem at hand. The method is also known as Sparse Bayesian Learning and Relevance Vector Machine.

The ARDR optimizer is chosen by setting the fit_method keyword to ardr. The threshold lambda parameter, which controls the sparsity of the solution vector, is set via the threshold_lambda keyword (default: 1e6).

Parameter

Type

Description

Default

threshold_lambda

float

controls the sparsity of the solution vector

1e6

### split-Bregman¶

The split-Bregman method [GolOsh09] is designed to solve a broad class of $$\ell_1$$-regularized problems. The solution vector $$\boldsymbol{x}$$ is given by

$\boldsymbol{x} = \arg\min_{\boldsymbol{x}, \boldsymbol{d}} \left\Vert\boldsymbol{d}\right\Vert_1 + \frac{1}{2} \left\Vert\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\right\Vert^2 + \frac{\lambda}{2} \left\Vert\boldsymbol{d} - \mu \boldsymbol{x} \right\Vert^2,$

where $$\boldsymbol{d}$$ is an auxiliary quantity, while $$\mu$$ and $$\lambda$$ are hyperparameters that control the sparseness of the solution and the efficiency of the algorithm.

The split-Bregman implementation supports the following additional keywords.

Parameter

Type

Description

Default

mu

float

sparseness parameter

1e-3

lmbda

float

weight of additional L2-norm in split-Bregman

100

n_iters

int

maximal number of split-Bregman iterations

1000

tol

float

convergence criterion iterative minimization

1e-6

verbose

bool

print additional information to stdout

False

### Recursive feature elimination¶

Recursive feature elimination (RFE) is a feature selection algorithm that obtains the optimal features by carrying out a series of fits, starting with the full set of parameters and then iteratively eliminating the less important ones. RFE needs to be combined with a specific fit method. Since RFE may require many hundreds of single fits its often advisable to use ordinary least-squares as training method, which is the default behavior. The present implementation is based on the implementation of feature selection in scikit-learn.

The RFE optimizer is chosen by setting the fit_method keyword to rfe. The n_features keyword allows one to specify the number of features to select. If this parameter is left unspecified RFE with cross-validation will be used to determine the optimal number of features.

After the optimal number of features has been determined the final model is trained. The fit method for the final fit can be controlled via final_estimator. Here, estimator and final_estimator can be set to any of the fit methods described in this section. For example, estimator='lasso' implies that a LASSO-CV scan is carried out for each fit in the RFE algorithm.

Parameter

Type

Description

Default

n_features

int

number of features to select

None

step

int

number parameters to eliminate

float

percentage of parameters to eliminate

0.04

cv_splits

int

number of CV splits (90/10) used when optimizing n_features

5

estimator

str

fit method to be used in RFE algorithm

'least-squares'

final_estimator

str

fit method to be used in the final fit

= estimator

estimator_kwargs

dict

keyword arguments for fit method defined by estimator

{}

final_estimator_kwargs

dict

keyword arguments for fit method defined by final_estimator

{}

Note

When running on multi-core systems please be mindful of memory consumption. By default all CPUs will be used (n_jobs=-1), which will duplicate data and can require a lot of memory, potentially giving rise to errors. To prevent this behavior you can set the [n_jobs parameter](https://scikit-learn.org/stable/glossary.html#term-n-jobs) explicitly, which is handed over directly to scikit-learn.

### Other methods¶

The optimizers furthermore support the ridge method (ridge), the elastic net method (elasticnet) as well as Bayesian ridge regression (bayesian-ridge).

## Optimizer¶

class icet.fitting.Optimizer(fit_data: Tuple[numpy.ndarray, numpy.ndarray], fit_method: str = 'least-squares', standardize: bool = True, train_size: Union[int, float] = 0.75, test_size: Optional[Union[int, float]] = None, train_set: Optional[Union[Tuple[int], List[int]]] = None, test_set: Optional[Union[Tuple[int], List[int]]] = None, check_condition: bool = True, seed: int = 42, **kwargs)[source]

This optimizer finds a solution to the linear $$\boldsymbol{A}\boldsymbol{x}=\boldsymbol{y}$$ problem.

One has to specify either train_size/test_size or train_set/test_set If either train_set or test_set (or both) is specified the fractions will be ignored.

Warning

Repeatedly setting up a Optimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple Optimizer instances.

Parameters
• fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

• fit_method (str) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”, “rfe”, “split-bregman”

• standardize (bool) – if True the fit matrix and target values are standardized before fitting, meaning columns in the fit matrix and th target values are rescaled to have a standard deviation of 1.0.

• train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.

• test_size (float or int) – If float represents the fraction of fit_data (rows) to be used for testing. If int, represents the absolute number of rows to be used for testing.

• train_set (tuple or list(int)) – indices of rows of A/y to be used for training

• test_set (tuple or list(int)) – indices of rows of A/y to be used for testing

• check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)

• seed (int) – seed for pseudo random number generator

train_scatter_data

target and predicted value for each row in the training set

Type

ScatterData

test_scatter_data

target and predicted value for each row in the test set

Type

ScatterData

compute_rmse(A: numpy.ndarray, y: numpy.ndarray)float

Returns the root mean squared error (RMSE) using $$\boldsymbol{A}$$, $$\boldsymbol{y}$$, and the vector of fitted parameters $$\boldsymbol{x}$$, corresponding to $$\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2$$.

Parameters
• A – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)

• y – vector of target values

property contributions_test: numpy.ndarray

average contribution to the predicted values for the test set from each parameter

property contributions_train: numpy.ndarray

average contribution to the predicted values for the train set from each parameter

property fit_method: str

fit method

get_contributions(A: numpy.ndarray)numpy.ndarray

Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.

Parameters

A – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

property n_nonzero_parameters: int

number of non-zero parameters

property n_parameters: int

number of parameters (=columns in A matrix)

property n_target_values: int

number of target values (=rows in A matrix)

property parameters: numpy.ndarray

copy of parameter vector

property parameters_norm: float

the norm of the parameters

predict(A: numpy.ndarray)Union[numpy.ndarray, float]

Predicts data given an input matrix $$\boldsymbol{A}$$, i.e., $$\boldsymbol{A}\boldsymbol{x}$$, where $$\boldsymbol{x}$$ is the vector of the fitted parameters. The method returns the vector of predicted values or a float if a single row provided as input.

Parameters

A – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

property rmse_test: float

root mean squared error for test set

property rmse_train: float

root mean squared error for training set

property seed: int

seed used to initialize pseudo random number generator

property standardize: bool

if True standardize the fit matrix before fitting

property summary: Dict[str, Any]

comprehensive information about the optimizer

property test_fraction: float

fraction of rows included in test set

property test_set: List[int]

indices of rows included in the test set

property test_size: int

number of rows included in test set

train()None[source]

Carries out training.

property train_fraction: float

fraction of rows included in training set

property train_set: List[int]

indices of rows included in the training set

property train_size: int

number of rows included in training set

write_summary(fname: str)

Writes summary dict to file

## EnsembleOptimizer¶

class icet.fitting.EnsembleOptimizer(fit_data: Tuple[numpy.ndarray, numpy.ndarray], fit_method: str = 'least-squares', standardize: bool = True, ensemble_size: int = 50, train_size: Union[int, float] = 1.0, bootstrap: bool = True, check_condition: bool = True, seed: int = 42, **kwargs)[source]

The ensemble optimizer carries out a series of single optimization runs using the Optimizer class in order to solve the linear $$\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y}$$ problem. Subsequently, it provides access to various ensemble averaged quantities such as errors and parameters.

Warning

Repeatedly setting up a EnsembleOptimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple EnsembleOptimizer instances.

Parameters
• fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

• fit_method (str) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”, “rfe”, “split-bregman”

• standardize (bool) – if True the fit matrix and target values are standardized before fitting, meaning columns in the fit matrix and th target values are rescaled to have a standard deviation of 1.0.

• ensemble_size (int) – number of fits in the ensemble

• train_size (float or int) – if float represents the fraction of fit_data (rows) to be used for training; if int, represents the absolute number of rows to be used for training

• bootstrap (bool) – if True sampling will be carried out with replacement

• check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)

• seed (int) – seed for pseudo random number generator

property bootstrap: bool

True if sampling is carried out with replacement

compute_rmse(A: numpy.ndarray, y: numpy.ndarray)float

Returns the root mean squared error (RMSE) using $$\boldsymbol{A}$$, $$\boldsymbol{y}$$, and the vector of fitted parameters $$\boldsymbol{x}$$, corresponding to $$\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2$$.

Parameters
• A – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)

• y – vector of target values

property ensemble_size: int

number of train rounds

property error_matrix: numpy.ndarray

matrix of fit errors where N is the number of target values and M is the number of fits (i.e., the size of the ensemble)

property fit_method: str

fit method

get_contributions(A: numpy.ndarray)numpy.ndarray

Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.

Parameters

A – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

property n_nonzero_parameters: int

number of non-zero parameters

property n_parameters: int

number of parameters (=columns in A matrix)

property n_target_values: int

number of target values (=rows in A matrix)

property parameter_vectors: List[numpy.ndarray]

all parameter vectors in the ensemble

property parameters: numpy.ndarray

copy of parameter vector

property parameters_norm: float

the norm of the parameters

property parameters_std: numpy.ndarray

standard deviation for each parameter

predict(A: numpy.ndarray, return_std: bool = False)Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Predicts data given an input matrix $$oldsymbol{A}$$, i.e., $$\boldsymbol{A}\boldsymbol{x}$$, where $$\boldsymbol{x}$$ is the vector of the fitted parameters. The method returns the vector of predicted values and optionally also the vector of standard deviations.

By using all parameter vectors in the ensemble a standard deviation of the prediction can be obtained.

Parameters
• A – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

• return_std – whether or not to return the standard deviation of the prediction

property rmse_test: float

ensemble average of root mean squared error over test sets

property rmse_test_ensemble: numpy.ndarray

root mean squared test errors obtained during for each fit in ensemble

property rmse_train: float

ensemble average of root mean squared error over train sets

property rmse_train_ensemble: numpy.ndarray

root mean squared train errors obtained during for each fit in ensemble

property seed: int

seed used to initialize pseudo random number generator

property standardize: bool

if True standardize the fit matrix before fitting

property summary: Dict[str, Any]

comprehensive information about the optimizer

train()None[source]

Carries out ensemble training and construct the final model by averaging over all models in the ensemble.

property train_fraction: float

fraction of input data used for training; this value can differ slightly from the value set during initialization due to rounding

property train_size: int

number of rows included in train sets; note that this will be different from the number of unique rows if boostrapping

write_summary(fname: str)

Writes summary dict to file

## CrossValidationEstimator¶

class icet.fitting.CrossValidationEstimator(fit_data: Tuple[numpy.ndarray, numpy.ndarray], fit_method: str = 'least-squares', standardize: bool = True, validation_method: str = 'k-fold', n_splits: int = 10, check_condition: bool = True, seed: int = 42, **kwargs)[source]

This class provides an optimizer with cross validation for solving the linear $$\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y}$$ problem. Cross-validation (CV) scores are calculated by splitting the available reference data in multiple different ways. It also produces the finalized model (using the full input data) for which the CV score is an estimation of its performance.

Warning

Repeatedly setting up a CrossValidationEstimator and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple CrossValidationEstimator instances.

Parameters
• fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

• fit_method (str) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”, “rfe”, “split-bregman”

• standardize (bool) – if True the fit matrix and target values are standardized before fitting, meaning columns in the fit matrix and th target values are rescaled to have a standard deviation of 1.0.

• validation_method (str) – method to use for cross-validation; possible choices are “shuffle-split”, “k-fold”

• n_splits (int) – number of times the fit data set will be split for the cross-validation

• check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)

• seed (int) – seed for pseudo random number generator

train_scatter_data

contains target and predicted values from each individual traininig set in the cross-validation split; ScatterData is a namedtuple.

Type

ScatterData

validation_scatter_data

contains target and predicted values from each individual validation set in the cross-validation split; ScatterData is a namedtuple.

Type

ScatterData

compute_rmse(A: numpy.ndarray, y: numpy.ndarray)float

Returns the root mean squared error (RMSE) using $$\boldsymbol{A}$$, $$\boldsymbol{y}$$, and the vector of fitted parameters $$\boldsymbol{x}$$, corresponding to $$\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2$$.

Parameters
• A – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)

• y – vector of target values

property fit_method: str

fit method

get_contributions(A: numpy.ndarray)numpy.ndarray

Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.

Parameters

A – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

property n_nonzero_parameters: int

number of non-zero parameters

property n_nonzero_parameters_splits: numpy.ndarray

number of non-zero parameters for each split

property n_parameters: int

number of parameters (=columns in A matrix)

property n_splits: int

number of splits (folds) used for cross-validation

property n_target_values: int

number of target values (=rows in A matrix)

property parameters: numpy.ndarray

copy of parameter vector

property parameters_norm: float

the norm of the parameters

property parameters_splits: numpy.ndarray

all parameters obtained during cross-validation

predict(A: numpy.ndarray)Union[numpy.ndarray, float]

Predicts data given an input matrix $$\boldsymbol{A}$$, i.e., $$\boldsymbol{A}\boldsymbol{x}$$, where $$\boldsymbol{x}$$ is the vector of the fitted parameters. The method returns the vector of predicted values or a float if a single row provided as input.

Parameters

A – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters

property rmse_train: float

average root mean squared training error obtained during cross-validation

property rmse_train_final: float

root mean squared error when using the full set of input data

property rmse_train_splits: numpy.ndarray

root mean squared training errors obtained during cross-validation

property rmse_validation: float

average root mean squared cross-validation error

property rmse_validation_splits: numpy.ndarray

root mean squared validation errors obtained during cross-validation

property seed: int

seed used to initialize pseudo random number generator

property standardize: bool

if True standardize the fit matrix before fitting

property summary: Dict[str, Any]

comprehensive information about the optimizer

train()None[source]

Constructs the final model using all input data available.

validate()None[source]

Runs validation.

property validation_method: str

validation method name

write_summary(fname: str)

Writes summary dict to file