Constructing a cluster expansion¶
In this step we will construct a cluster expansion using a dataset of Ag-Pd structures, which have been relaxed using density functional theory calculations.
General preparations¶
A number of ASE and icet functions are needed in order to set up and train the cluster expansion.
Since the reference data is provided in the form of an ASE database we require the ase.db.connect()
function.
Furthermore, the icet classes ClusterSpace
, StructureContainer
, CrossValidationEstimator
and ClusterExpansion
are used, in sequence, during preparation, compilation and training of the cluster expansion followed by the extraction of information in the form of predicted energies from the latter.
# This scripts runs in about 16 seconds on an i7-6700K CPU.
from ase.db import connect
from icet import ClusterSpace, StructureContainer, ClusterExpansion
from trainstation import CrossValidationEstimator
Then we open a connection to the reference database and use the first structure in the database as the primitive structure as we happen to have prepared the database in this way.
db = connect('reference_data.db')
primitive_structure = db.get(id=1).toatoms() # primitive structure
Preparing a cluster space¶
In order to be able to build a cluster expansion, it is first necessary to create a ClusterSpace
object based on a prototype structure.
When initiating the former, one must also provide cutoffs and the chemical elements that are allowed to occupy different lattice sites.
The cutoffs are specified in the form of a list with the different elements corresponding to clusters of increasing order (pairs, triplets, quadruplets, etc). The values then specify the longest distance allowed between any two atoms in clusters of the respective order. In the example below, the cluster space will contain all pairs of atoms that are 13.5 Å or closer to each other, all triplets among which the longest distance is 6.5 Å or less, and all quadruplets among which the longest distance is 6.0 Å or less. If higher-order clusters (quintuplets etc.) are to be included, one would simply extend the list. Note here that one should typically select cutoffs with some care based on how cluster expansions accuracy changes with cutoffs.
The allowed chemical elements are specified as a list. Two formats are
possible. If all sites in the structure are to be occupied identically, it
suffices to provide a simple list of chemical symbols, e.g., ['Ag', 'Pd']
in the case below. If there are multiple sites that are to be occupied in
different fashion, one has to provide instead a list of lists, where the outer
list must contain as many elements as there are sites in the primitive
structure and each “inner” list specifies the occupation for the respective
site. Examples for this approach can be found in the ClusterSpace
documentation.
cs = ClusterSpace(structure=primitive_structure,
cutoffs=[13.5, 6.5, 6.0],
chemical_symbols=['Ag', 'Pd'])
print(cs)
Note
Here, we include all structures from the database with six or less atoms in the unit cell. This approach has been adopted in this basic tutorial for the sake of simplicity. In general this is not the preferred approach to assembling a data set.
Tip
As with many other icet objects, it is possible to print core
information in a tabular format by simply calling the print()
function
with the instance of interest as input argument. When using jupyter notebooks
you can also use the display()
command which renders the output in a
visually more appealing format.
For the case at hand, the output of the print()
command should look as follows:
====================================== Cluster Space ======================================
space group : Fm-3m (225)
chemical species : ['Ag', 'Pd'] (sublattice A)
cutoffs : 13.5000 6.5000 6.0000
total number of parameters : 82
number of parameters by order : 0= 1 1= 1 2= 25 3= 20 4= 35
fractional_position_tolerance : 2e-06
position_tolerance : 1e-05
symprec : 1e-05
-------------------------------------------------------------------------------------------
index | order | radius | multiplicity | orbit_index | multicomponent_vector | sublattices
-------------------------------------------------------------------------------------------
0 | 0 | 0.0000 | 1 | -1 | . | .
1 | 1 | 0.0000 | 1 | 0 | [0] | A
2 | 2 | 1.4460 | 6 | 1 | [0, 0] | A-A
3 | 2 | 2.0450 | 3 | 2 | [0, 0] | A-A
4 | 2 | 2.5046 | 12 | 3 | [0, 0] | A-A
5 | 2 | 2.8921 | 6 | 4 | [0, 0] | A-A
6 | 2 | 3.2334 | 12 | 5 | [0, 0] | A-A
7 | 2 | 3.5420 | 4 | 6 | [0, 0] | A-A
8 | 2 | 3.8258 | 24 | 7 | [0, 0] | A-A
9 | 2 | 4.0900 | 3 | 8 | [0, 0] | A-A
...
72 | 4 | 2.7940 | 12 | 71 | [0, 0, 0, 0] | A-A-A-A
73 | 4 | 2.7940 | 24 | 72 | [0, 0, 0, 0] | A-A-A-A
74 | 4 | 2.8281 | 24 | 73 | [0, 0, 0, 0] | A-A-A-A
75 | 4 | 2.8921 | 3 | 74 | [0, 0, 0, 0] | A-A-A-A
76 | 4 | 2.8921 | 6 | 75 | [0, 0, 0, 0] | A-A-A-A
77 | 4 | 2.8921 | 12 | 76 | [0, 0, 0, 0] | A-A-A-A
78 | 4 | 2.9632 | 24 | 77 | [0, 0, 0, 0] | A-A-A-A
79 | 4 | 2.9862 | 8 | 78 | [0, 0, 0, 0] | A-A-A-A
80 | 4 | 3.0951 | 24 | 79 | [0, 0, 0, 0] | A-A-A-A
81 | 4 | 3.5420 | 2 | 80 | [0, 0, 0, 0] | A-A-A-A
===========================================================================================
Note
The radius
is not the same as the longest distance between atoms in
the cluster (which was the measure used for initialization via cutoffs
),
but is the average distance from all atoms to the geometric center of mass (assuming
all atoms have the same mass).
Compiling a structure container¶
Once a ClusterSpace
has been prepared, the next
step is to compile a StructureContainer
. To
this end, we first initialize an empty StructureContainer
and then add the structures from the database
prepared previously including for each structure the mixing energy in the
property dictionary.
sc = StructureContainer(cluster_space=cs)
for row in db.select():
sc.add_structure(structure=row.toatoms(),
user_tag=row.tag,
properties={'mixing_energy': row.mixing_energy})
print(sc)
Note
While here we add only one property, the StructureContainer
allows the addition of several properties. This
can be useful e.g., when constructing (and sampling) CEs for a couple of
different properties.
By calling the print()
function with the StructureContainer
as input argument, one obtains the following
result:
====================== Structure Container ======================
Total number of structures: 625
-----------------------------------------------------------------
index | user_tag | n_atoms | chemical formula | mixing_energy
-----------------------------------------------------------------
0 | Ag | 1 | Ag | 0.0000
1 | Pd | 1 | Pd | 0.0000
2 | AgPd_0002 | 2 | AgPd | -0.0398
3 | AgPd_0003 | 3 | AgPd2 | -0.0286
4 | AgPd_0004 | 3 | Ag2Pd | -0.0485
5 | AgPd_0005 | 3 | AgPd2 | -0.0178
6 | AgPd_0006 | 3 | Ag2Pd | -0.0557
7 | AgPd_0007 | 3 | AgPd2 | -0.0297
8 | AgPd_0008 | 3 | Ag2Pd | -0.0477
9 | AgPd_0009 | 4 | AgPd3 | -0.0173
10 | AgPd_0010 | 4 | Ag3Pd | -0.0356
11 | AgPd_0011 | 4 | Ag2Pd2 | -0.0331
12 | AgPd_0012 | 4 | AgPd3 | -0.0139
13 | AgPd_0013 | 4 | Ag3Pd | -0.0459
14 | AgPd_0014 | 4 | Ag2Pd2 | -0.0478
15 | AgPd_0015 | 4 | AgPd3 | -0.0201
16 | AgPd_0016 | 4 | Ag3Pd | -0.0465
17 | AgPd_0017 | 4 | Ag2Pd2 | -0.0498
18 | AgPd_0018 | 4 | AgPd3 | -0.0162
19 | AgPd_0019 | 4 | Ag3Pd | -0.0369
20 | AgPd_0020 | 4 | Ag2Pd2 | -0.0428
21 | AgPd_0021 | 4 | AgPd3 | -0.0151
22 | AgPd_0022 | 4 | Ag3Pd | -0.0527
23 | AgPd_0023 | 4 | Ag2Pd2 | -0.0520
24 | AgPd_0024 | 4 | AgPd3 | -0.0292
25 | AgPd_0025 | 4 | Ag3Pd | -0.0470
26 | AgPd_0026 | 4 | AgPd3 | -0.0238
27 | AgPd_0027 | 4 | Ag3Pd | -0.0526
28 | AgPd_0028 | 5 | AgPd4 | -0.0152
29 | AgPd_0029 | 5 | Ag2Pd3 | -0.0180
30 | AgPd_0030 | 5 | Ag2Pd3 | -0.0417
31 | AgPd_0031 | 5 | Ag4Pd | -0.0283
32 | AgPd_0032 | 5 | Ag3Pd2 | -0.0236
33 | AgPd_0033 | 5 | Ag3Pd2 | -0.0547
34 | AgPd_0034 | 5 | AgPd4 | -0.0089
35 | AgPd_0035 | 5 | Ag2Pd3 | -0.0324
36 | AgPd_0036 | 5 | Ag2Pd3 | -0.0341
37 | AgPd_0037 | 5 | Ag4Pd | -0.0362
38 | AgPd_0038 | 5 | Ag3Pd2 | -0.0471
39 | AgPd_0039 | 5 | Ag3Pd2 | -0.0527
40 | AgPd_0040 | 5 | AgPd4 | -0.0128
41 | AgPd_0041 | 5 | Ag2Pd3 | -0.0379
42 | AgPd_0042 | 5 | Ag2Pd3 | -0.0403
43 | AgPd_0043 | 5 | Ag4Pd | -0.0396
44 | AgPd_0044 | 5 | Ag3Pd2 | -0.0603
45 | AgPd_0045 | 5 | Ag3Pd2 | -0.0546
46 | AgPd_0046 | 5 | AgPd4 | -0.0135
47 | AgPd_0047 | 5 | Ag2Pd3 | -0.0281
48 | AgPd_0048 | 5 | Ag2Pd3 | -0.0345
...
624 | AgPd_0505 | 8 | Ag2Pd6 | -0.0165
=================================================================
Training CE parameters¶
Since the StructureContainer
object created
in the previous section, contains all the information required for
constructing a cluster expansion, the next step is to train the parameters,
i.e. to fit the effective cluster interactions (ECIs) using the
target mixing energies. More precisely, the goal is to achieve the best possible
agreement with a set of training structures, which represent a subset of all
the structures in the StructureContainer
. In practice, this is a two step process
that involves the initiation of an optimizer object (here a
CrossValidationEstimator
)
with a list of target properties produced by the get_fit_data()
method of the
StructureContainer
as input argument.
opt = CrossValidationEstimator(fit_data=sc.get_fit_data(key='mixing_energy'), fit_method='ardr')
opt.validate()
opt.train()
print(opt)
The CrossValidationEstimator
optimizer used here is intended to provide a reliable estimate for the cross validation score.
This is achieved by calling the validate
method.
With the default settings used here the CrossValidationEstimator
then randomly splits the available data into a training and a validation set.
Here, the effective cluster interactions (ECIs) are obtained using the
LASSO method (fit_method
).
This procedure is repeated 10 times (the default value for number_of_splits
).
Note
The optimized parameters returned by the optimizer are actually not the ECIs but the ECIs times the multiplicity of the respective orbit. The distinction is handled internally but it is something to be aware of when inspecting the parameters directly.
Note
You will likely see a few “Convergence errors” when executing this command. For the present purpose these can be ignored. They arise since the internal scikit-learn optimization loops have exceeded the number of iterations without reaching the default target accuracy. It is possible to extend the iterations but this increases the run time without a marked improvement in the quality of the cluster expansion.
The “final” CE is ultimately constructed using all available data by calling the train
method.
Once it is finished, the results can be displayed by providing the CrossValidationEstimator
object to the print()
function, which gives the output shown below:
============== CrossValidationEstimator ==============
seed : 42
fit_method : ardr
standardize : True
n_target_values : 625
n_parameters : 82
n_nonzero_parameters : 44
parameters_norm : 0.07314093
target_values_std : 0.01486413
rmse_train : 0.001632753
R2_train : 0.988
AIC : -7937.285
BIC : -7742.024
validation_method : k-fold
n_splits : 10
rmse_validation : 0.001889973
R2_validation : 0.9829637
shuffle : True
======================================================
We have thus constructed a CE with an average root mean square error (RMSE,
rmse_validation
) for the validation set of only 1.8 meV/atom. The original
cluster space included 82 parameters (number_of_parameters
), 44 of which
are non-zero (number_of_nonzero_parameters
) in the final CE. The efficiency
of the ARDR method for finding sparse solutions is evident from the
number of non-zero parameters (44) being much smaller than the total number of
parameters (82). The performance and application area of different optimization
algorithms are analyzed and compared in the advanced tutorial section.
Note
Here we have used the CrossValidationEstimator
from the trainstation package.
More information can be found in the documentation of this package.
Finalizing the cluster expansion¶
At this point, the task of constructing the cluster expansion is almost
complete. The only step that remains is to tie the parameter values obtained
from the optimization to the cluster space. This is achieved through the
initiation of a ClusterExpansion
object using
the previously created ClusterSpace
instance and
the list of parameters, available via the parameters
attribute of the optimizer, as input arguments.
ce = ClusterExpansion(cluster_space=cs, parameters=opt.parameters, metadata=opt.summary)
print(ce)
ce.write('mixing_energy.ce')
Information regarding the parameters and associated cluster space
can be displayed by using the print()
function with the
ClusterExpansion
object as input argument:
================================================ Cluster Expansion ================================================
space group : Fm-3m (225)
chemical species : ['Ag', 'Pd'] (sublattice A)
cutoffs : 13.5000 6.5000 6.0000
total number of parameters : 82
number of parameters by order : 0= 1 1= 1 2= 25 3= 20 4= 35
fractional_position_tolerance : 2e-06
position_tolerance : 1e-05
symprec : 1e-05
total number of nonzero parameters : 44
number of nonzero parameters by order : 0= 1 1= 1 2= 14 3= 13 4= 15
-------------------------------------------------------------------------------------------------------------------
index | order | radius | multiplicity | orbit_index | multicomponent_vector | sublattices | parameter | ECI
-------------------------------------------------------------------------------------------------------------------
0 | 0 | 0.0000 | 1 | -1 | . | . | -0.046 | -0.046
1 | 1 | 0.0000 | 1 | 0 | [0] | A | -0.0366 | -0.0366
2 | 2 | 1.4460 | 6 | 1 | [0, 0] | A-A | 0.0306 | 0.0051
3 | 2 | 2.0450 | 3 | 2 | [0, 0] | A-A | 0.0141 | 0.0047
4 | 2 | 2.5046 | 12 | 3 | [0, 0] | A-A | 0.0206 | 0.00171
5 | 2 | 2.8921 | 6 | 4 | [0, 0] | A-A | 0.00163 | 0.000272
6 | 2 | 3.2334 | 12 | 5 | [0, 0] | A-A | 0 | 0
7 | 2 | 3.5420 | 4 | 6 | [0, 0] | A-A | 0.00328 | 0.000819
8 | 2 | 3.8258 | 24 | 7 | [0, 0] | A-A | 0 | 0
9 | 2 | 4.0900 | 3 | 8 | [0, 0] | A-A | 0.000574 | 0.000191
...
72 | 4 | 2.7940 | 12 | 71 | [0, 0, 0, 0] | A-A-A-A | 0.000911 | 7.59e-05
73 | 4 | 2.7940 | 24 | 72 | [0, 0, 0, 0] | A-A-A-A | 0 | 0
74 | 4 | 2.8281 | 24 | 73 | [0, 0, 0, 0] | A-A-A-A | -0.00278 | -0.000116
75 | 4 | 2.8921 | 3 | 74 | [0, 0, 0, 0] | A-A-A-A | 0 | 0
76 | 4 | 2.8921 | 6 | 75 | [0, 0, 0, 0] | A-A-A-A | -0.00067 | -0.000112
77 | 4 | 2.8921 | 12 | 76 | [0, 0, 0, 0] | A-A-A-A | -0.00054 | -4.5e-05
78 | 4 | 2.9632 | 24 | 77 | [0, 0, 0, 0] | A-A-A-A | 0 | 0
79 | 4 | 2.9862 | 8 | 78 | [0, 0, 0, 0] | A-A-A-A | 0.00121 | 0.000151
80 | 4 | 3.0951 | 24 | 79 | [0, 0, 0, 0] | A-A-A-A | 0 | 0
81 | 4 | 3.5420 | 2 | 80 | [0, 0, 0, 0] | A-A-A-A | -0.000936 | -0.000468
===================================================================================================================
Note that in the table above the parameters obtained from the optimizer and the ECIs are shown separately, with the multiplication factor being the multiplicity of the respective orbit.
Finally, the CE is written to file in order to be reused in the following steps of the tutorial.
Source code¶
The complete source code is available in
examples/tutorial/1_construct_cluster_expansion.py
# This scripts runs in about 16 seconds on an i7-6700K CPU.
from ase.db import connect
from icet import ClusterSpace, StructureContainer, ClusterExpansion
from trainstation import CrossValidationEstimator
# step 1: Basic setup
db = connect('reference_data.db')
primitive_structure = db.get(id=1).toatoms() # primitive structure
# step 2: Set up the basic structure and a cluster space
cs = ClusterSpace(structure=primitive_structure,
cutoffs=[13.5, 6.5, 6.0],
chemical_symbols=['Ag', 'Pd'])
print(cs)
# step 3: Parse the input structures and set up a structure container
sc = StructureContainer(cluster_space=cs)
for row in db.select():
sc.add_structure(structure=row.toatoms(),
user_tag=row.tag,
properties={'mixing_energy': row.mixing_energy})
print(sc)
# step 4: Train parameters
opt = CrossValidationEstimator(fit_data=sc.get_fit_data(key='mixing_energy'), fit_method='ardr')
opt.validate()
opt.train()
print(opt)
# step 5: Construct cluster expansion and write it to file
ce = ClusterExpansion(cluster_space=cs, parameters=opt.parameters, metadata=opt.summary)
print(ce)
ce.write('mixing_energy.ce')