Optimizers¶
Overview¶
The scikitlearn library provides functionality for training linear models and a large number of related tools. The present module provides simplified interfaces for various linear model regression methods. These methods are set up in a way that work out of the box for typical problems in cluster expansion and force constant potential construction, including slight adjustments to scikitlearn default values. If you need more flexibility, extended functionality or the ability to finetune parameters that are not included in this interface, it is possible to use scikitlearn directly.
The most commonly used fit methods in the present context are LASSO, automatic relevance determination regression (ARDR), recursive feature elimination with \(\ell_2\)fitting (RFEL2) as well as ordinary leastsquares optimization (OLS). Below follows a short summary of the main algorithms. More information about the available linear models can be found in the scikitlearn documentation.
Leastsquares¶
Ordinary leastsquares (OLS) optimization is providing a solution to the linear problem
where \(\boldsymbol{A}\) is the sensing matrix, \(\boldsymbol{y}\) is the vector of target values, and \(\boldsymbol{x}\) is the solution (parameter vector) that one seeks to obtain. The objective is given by
The OLS method is chosen by setting the fit_method
keyword to
leastsquares
.
LASSO¶
The least absolute shrinkage and selection operator (LASSO) is a method for performing variable selection and regularization in problems in statistics and machine learning. The optimization objective is given by
While the first term ensures that \(\boldsymbol{x}\) is a solution to the linear problem at hand, the second term introduces regularization and guides the algorithm toward finding sparse solutions, in the spirit of compressive sensing. In general, LASSO is suited for solving strongly underdetermined problems.
The LASSO optimizer is chosen by setting the fit_method
keyword to
lasso
. The \(\alpha\) parameter is set via the alpha
keyword. If no
value is specified a line scan will be carried out automatically to determine
the optimal value.
Parameter  Type  Description  Default 

alpha 
float 
controls the sparsity of the solution vector  None 
Automatic relevance determination regression (ARDR)¶
Automatic relevance determination regression (ARDR) is an optimization algorithm provided by scikitlearn that is similar to Bayesian Ridge Regression, which provides a probabilistic model of the regression problem at hand. The method is also known as Sparse Bayesian Learning and Relevance Vector Machine.
The ARDR optimizer is chosen by setting the fit_method
keyword to ardr
.
The threshold lambda parameter, which controls the sparsity of the solution
vector, is set via the threshold_lambda
keyword (default: 1e6).
Parameter  Type  Description  Default 

threshold_lambda 
float 
controls the sparsity of the solution vector  1e6 
splitBregman¶
The splitBregman method [GolOsh09] is designed to solve a broad class of \(\ell_1\)regularized problems. The solution vector \(\boldsymbol{x}\) is given by
where \(\boldsymbol{d}\) is an auxiliary quantity, while \(\mu\) and \(\lambda\) are hyperparameters that control the sparseness of the solution and the efficiency of the algorithm.
The splitBregman implementation supports the following additional keywords.
Parameter  Type  Description  Default 

mu 
float 
sparseness parameter  1e3 
lmbda 
float 
weight of additional L2norm in splitBregman  100 
n_iters 
int 
maximal number of splitBregman iterations  1000 
tol 
float 
convergence criterion iterative minimization  1e6 
verbose 
bool 
print additional information to stdout  False 
Recursive feature elimination¶
Recursive feature elimination (RFE) is a feature selection algorithm that obtains the optimal features by carrying out a series of fits, starting with the full set of parameters and then iteratively eliminating the less important ones. RFE needs to be combined with a specific fit method. Since RFE may require many hundreds of single fits its often advisable to use ordinary leastsquares as training method, which is the default behavior. The present implementation is based on the implementation of feature selection in scikitlearn.
The RFE optimizer is chosen by setting the fit_method
keyword to
rfe
. The n_features
keyword allows one to specify the number of
features to select. If this parameter is left unspecified RFE with
crossvalidation will be used to determine the optimal number of features.
After the optimal number of features has been determined the final model is
trained. The fit method for the final fit can be controlled via
final_estimator
. Here, estimator
and final_estimator
can be set to
any of the fit methods described in this section. For example,
estimator='lasso'
implies that a LASSOCV scan is carried out for each fit
in the RFE algorithm.
Parameter  Type  Description  Default 

n_features 
int 
number of features to select  None 
step 
int 
number parameters to eliminate  
float 
percentage of parameters to eliminate  0.04 

cv_splits 
int 
number of CV splits (90/10) used when optimizing n_features 
5 
estimator 
str 
fit method to be used in RFE algorithm  'leastsquares' 
final_estimator 
str 
fit method to be used in the final fit  = estimator 
estimator_kwargs 
dict 
keyword arguments for fit method defined by estimator 
{} 
final_estimator_kwargs 
dict 
keyword arguments for fit method defined by final_estimator 
{} 
Note
When running on multicore systems please be mindful of memory consumption. By default all CPUs will be used (n_jobs=1), which will duplicate data and can require a lot of memory, potentially giving rise to errors. To prevent this behavior you can set the [n_jobs parameter](https://scikitlearn.org/stable/glossary.html#termnjobs) explicitly, which is handed over directly to scikitlearn.
Other methods¶
The optimizers furthermore support the ridge
method
(ridge
), the elastic net
method
(elasticnet
) as well as Bayesian ridge regression
(bayesianridge
).
Optimizer¶

class
icet.fitting.
Optimizer
(fit_data, fit_method='leastsquares', standardize=True, train_size=0.75, test_size=None, train_set=None, test_set=None, check_condition=True, seed=42, **kwargs)[source]¶ This optimizer finds a solution to the linear \(\boldsymbol{A}\boldsymbol{x}=\boldsymbol{y}\) problem.
One has to specify either train_size/test_size or train_set/test_set If either train_set or test_set (or both) is specified the fractions will be ignored.
Warning
Repeatedly setting up a Optimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple Optimizer instances.
Parameters:  fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 fit_method (str) – method to be used for training; possible choice are “leastsquares”, “lasso”, “elasticnet”, “bayesianridge”, “ardr”, “rfel2”, “splitbregman”
 standardize (bool) – if True the fit matrix is standardized before fitting
 train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.
 test_size (float or int) – If float represents the fraction of fit_data (rows) to be used for testing. If int, represents the absolute number of rows to be used for testing.
 train_set (tuple or list(int)) – indices of rows of A/y to be used for training
 test_set (tuple or list(int)) – indices of rows of A/y to be used for testing
 check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)
 seed (int) – seed for pseudo random number generator

train_scatter_data
¶ target and predicted value for each row in the training set
Type: ScatterData

test_scatter_data
¶ target and predicted value for each row in the test set
Type: ScatterData

compute_rmse
(A, y)¶ Returns the root mean squared error (RMSE) using \(\boldsymbol{A}\), \(\boldsymbol{y}\), and the vector of fitted parameters \(\boldsymbol{x}\), corresponding to \(\\boldsymbol{A}\boldsymbol{x}\boldsymbol{y}\_2\).
Parameters: Return type: float

contributions_test
¶ average contribution to the predicted values for the test set from each parameter
Return type: ndarray

contributions_train
¶ average contribution to the predicted values for the train set from each parameter
Return type: ndarray

fit_method
¶ fit method
Return type: str

get_contributions
(A)¶ Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.
Parameters: A ( ndarray
) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parametersReturn type: ndarray

n_nonzero_parameters
¶ number of nonzero parameters
Return type: int

n_parameters
¶ number of parameters (=columns in A matrix)
Return type: int

n_target_values
¶ number of target values (=rows in A matrix)
Return type: int

predict
(A)¶ Predicts data given an input matrix \(\boldsymbol{A}\), i.e., \(\boldsymbol{A}\boldsymbol{x}\), where \(\boldsymbol{x}\) is the vector of the fitted parameters. The method returns the vector of predicted values or a float if a single row provided as input.
Parameters: A ( ndarray
) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parametersReturn type: Union
[ndarray
,float
]

rmse_test
¶ root mean squared error for test set
Return type: float

rmse_train
¶ root mean squared error for training set
Return type: float

seed
¶ seed used to initialize pseudo random number generator
Return type: int

standardize
¶ if True standardize the fit matrix before fitting
Return type: bool

summary
¶ comprehensive information about the optimizer
Return type: Dict
[str
,Any
]

test_fraction
¶ fraction of rows included in test set
Return type: float

test_set
¶ indices of rows included in the test set
Return type: List
[int
]

test_size
¶ number of rows included in test set
Return type: int

train_fraction
¶ fraction of rows included in training set
Return type: float

train_set
¶ indices of rows included in the training set
Return type: List
[int
]

train_size
¶ number of rows included in training set
Return type: int
EnsembleOptimizer¶

class
icet.fitting.
EnsembleOptimizer
(fit_data, fit_method='leastsquares', standardize=True, ensemble_size=50, train_size=1.0, bootstrap=True, check_condition=True, seed=42, **kwargs)[source]¶ The ensemble optimizer carries out a series of single optimization runs using the
Optimizer
class in order to solve the linear \(\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y}\) problem. Subsequently, it provides access to various ensemble averaged quantities such as errors and parameters.Warning
Repeatedly setting up a EnsembleOptimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple EnsembleOptimizer instances.
Parameters:  fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 fit_method (str) – method to be used for training; possible choice are “leastsquares”, “lasso”, “elasticnet”, “bayesianridge”, “ardr”, “rfel2”, “splitbregman”
 standardize (bool) – if True the fit matrix is standardized before fitting
 ensemble_size (int) – number of fits in the ensemble
 train_size (float or int) – if float represents the fraction of fit_data (rows) to be used for training; if int, represents the absolute number of rows to be used for training
 bootstrap (bool) – if True sampling will be carried out with replacement
 check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)
 seed (int) – seed for pseudo random number generator

bootstrap
¶ True if sampling is carried out with replacement
Return type: bool

compute_rmse
(A, y)¶ Returns the root mean squared error (RMSE) using \(\boldsymbol{A}\), \(\boldsymbol{y}\), and the vector of fitted parameters \(\boldsymbol{x}\), corresponding to \(\\boldsymbol{A}\boldsymbol{x}\boldsymbol{y}\_2\).
Parameters: Return type: float

ensemble_size
¶ number of train rounds
Return type: int

error_matrix
¶ matrix of fit errors where N is the number of target values and M is the number of fits (i.e., the size of the ensemble)
Return type: ndarray

fit_method
¶ fit method
Return type: str

get_contributions
(A)¶ Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.
Parameters: A ( ndarray
) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parametersReturn type: ndarray

n_nonzero_parameters
¶ number of nonzero parameters
Return type: int

n_parameters
¶ number of parameters (=columns in A matrix)
Return type: int

n_target_values
¶ number of target values (=rows in A matrix)
Return type: int

predict
(A, return_std=False)[source]¶ Predicts data given an input matrix \(oldsymbol{A}\), i.e., \(\boldsymbol{A}\boldsymbol{x}\), where \(\boldsymbol{x}\) is the vector of the fitted parameters. The method returns the vector of predicted values and optionally also the vector of standard deviations.
By using all parameter vectors in the ensemble a standard deviation of the prediction can be obtained.
Parameters:  A (
ndarray
) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters  return_std (
bool
) – whether or not to return the standard deviation of the prediction
Return type:  A (

rmse_test
¶ ensemble average of root mean squared error over test sets
Return type: float

rmse_test_ensemble
¶ root mean squared test errors obtained during for each fit in ensemble
Return type: ndarray

rmse_train
¶ ensemble average of root mean squared error over train sets
Return type: float

rmse_train_ensemble
¶ root mean squared train errors obtained during for each fit in ensemble
Return type: ndarray

seed
¶ seed used to initialize pseudo random number generator
Return type: int

standardize
¶ if True standardize the fit matrix before fitting
Return type: bool

summary
¶ comprehensive information about the optimizer
Return type: Dict
[str
,Any
]

train
()[source]¶ Carries out ensemble training and construct the final model by averaging over all models in the ensemble.
Return type: None

train_fraction
¶ fraction of input data used for training; this value can differ slightly from the value set during initialization due to rounding
Return type: float

train_size
¶ number of rows included in train sets; note that this will be different from the number of unique rows if boostrapping
Return type: int
CrossValidationEstimator¶

class
icet.fitting.
CrossValidationEstimator
(fit_data, fit_method='leastsquares', standardize=True, validation_method='kfold', n_splits=10, check_condition=True, seed=42, **kwargs)[source]¶ This class provides an optimizer with cross validation for solving the linear \(\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y}\) problem. Crossvalidation (CV) scores are calculated by splitting the available reference data in multiple different ways. It also produces the finalized model (using the full input data) for which the CV score is an estimation of its performance.
Warning
Repeatedly setting up a CrossValidationEstimator and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple CrossValidationEstimator instances.
Parameters:  fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 fit_method (str) – method to be used for training; possible choice are “leastsquares”, “lasso”, “elasticnet”, “bayesianridge”, “ardr”, “rfel2”, “splitbregman”
 standardize (bool) – if True the fit matrix is standardized before fitting
 validation_method (str) – method to use for crossvalidation; possible choices are “shufflesplit”, “kfold”
 n_splits (int) – number of times the fit data set will be split for the crossvalidation
 check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)
 seed (int) – seed for pseudo random number generator

train_scatter_data
¶ contains target and predicted values from each individual traininig set in the crossvalidation split;
ScatterData
is a namedtuple.Type: ScatterData

validation_scatter_data
¶ contains target and predicted values from each individual validation set in the crossvalidation split;
ScatterData
is a namedtuple.Type: ScatterData

compute_rmse
(A, y)¶ Returns the root mean squared error (RMSE) using \(\boldsymbol{A}\), \(\boldsymbol{y}\), and the vector of fitted parameters \(\boldsymbol{x}\), corresponding to \(\\boldsymbol{A}\boldsymbol{x}\boldsymbol{y}\_2\).
Parameters: Return type: float

fit_method
¶ fit method
Return type: str

get_contributions
(A)¶ Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.
Parameters: A ( ndarray
) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parametersReturn type: ndarray

n_nonzero_parameters
¶ number of nonzero parameters
Return type: int

n_nonzero_parameters_splits
¶ number of nonzero parameters for each split
Return type: list

n_parameters
¶ number of parameters (=columns in A matrix)
Return type: int

n_splits
¶ number of splits (folds) used for crossvalidation
Return type: int

n_target_values
¶ number of target values (=rows in A matrix)
Return type: int

predict
(A)¶ Predicts data given an input matrix \(\boldsymbol{A}\), i.e., \(\boldsymbol{A}\boldsymbol{x}\), where \(\boldsymbol{x}\) is the vector of the fitted parameters. The method returns the vector of predicted values or a float if a single row provided as input.
Parameters: A ( ndarray
) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parametersReturn type: Union
[ndarray
,float
]

rmse_train
¶ average root mean squared training error obtained during crossvalidation
Return type: float

rmse_train_final
¶ root mean squared error when using the full set of input data
Return type: float

rmse_train_splits
¶ root mean squared training errors obtained during crossvalidation
Return type: ndarray

rmse_validation
¶ average root mean squared crossvalidation error
Return type: float

rmse_validation_splits
¶ root mean squared validation errors obtained during crossvalidation
Return type: ndarray

seed
¶ seed used to initialize pseudo random number generator
Return type: int

standardize
¶ if True standardize the fit matrix before fitting
Return type: bool

summary
¶ comprehensive information about the optimizer
Return type: Dict
[str
,Any
]

validation_method
¶ validation method name
Return type: str