# Generating training structures¶

Here, we will introduce one way of generating a pool of structures that can be used for training. There are of course more ways of doing this, for example, the structure enumeration introduced above. However, this should be a robust way of generating your training structures and it will work even for systems with large primitive cells for which, e.g., enumeration will fail.

The idea behind this structure generation scheme is to from a large pool of randomized structures select a small subset with fixed number of structures which minimizes the condition number. In order to select the subset of structures a Markov chain-Monte Carlo (MCMC) scheme is employed where structures are swapped in and out of the training set using the change in metric as acceptance criteria. The effective temperature in the MC simulation is lowered in a simulated annealing fashion in order to obtain the optimal subset of structures. This structure selection approach can, e.g., help with large condition number warnings obtained in the training procedure.

If you, however, have linearly dependent parameters, this could be if you, e.g., have a concentration restriction that makes one atomic type depend on another.
Then, this algorithm will fail since the condition number will always be infinite for a system with linearly dependent samples.
However, this could easily be fixed by merging the two linearly dependent orbits, see, `merge_orbits`

.

## Import modules¶

The `structure_selection_annealing`

,
`occupy_structure_randomly`

and
`ClusterSpace`

functions need to be imported together with some
additional functions from ase and
numpy.

```
import numpy as np
from ase.build import bulk
from icet import ClusterSpace
from icet.tools.structure_generation import occupy_structure_randomly
from icet.tools.training_set_generation import structure_selection_annealing
```

## Set up the cluster space¶

First, you need to select cutoffs. This can be difficult if you do not have any prior knowledge of the system. However, if the you realize after the first training iteration that the cutoffs were not sufficiently long you can easily append more structure by repeating the process with longer cutoffs.

```
primitive_structure = bulk('Au', 'fcc', 4.0)
cutoffs = [10.0, 6.0, 4.0]
subelements = ['Au', 'Pd']
cluster_space = ClusterSpace(primitive_structure, cutoffs, subelements)
```

## Generate a pool of random structures¶

Next, we are going to generate a pool of random structures that is used in the annealing process. Random structures can be generated in multiple different ways, with, e.g., respect to which supercells to consider, which distribution of concentration to draw from etc. Here, we simply repeat the primitive face-centered cubic (FCC) cell with 3 random integers, such that the number of atoms in a given cell is smaller than 50. Next, a concentration is randomly drawn from a uniform distribution between 0 and 1, and the supercell is randomly occupied in accordance with the concentration.

```
n_random_structures = 10000
max_repeat = 8
max_atoms = 50
structures = []
for _ in range(n_random_structures):
# Create random supercell.
supercell = get_random_supercell_size(max_repeat, max_atoms, len(primitive_structure))
structure = primitive_structure.repeat(supercell)
# Randomize concentrations in the supercell
n_atoms = len(structure)
n_Au = np.random.randint(0, n_atoms)
n_Pd = n_atoms - n_Au
concentration = {'Au': n_Au / n_atoms, 'Pd': n_Pd / n_atoms}
# Occupy the structure randomly and store it.
occupy_structure_randomly(structure, cluster_space, concentration)
structures.append(structure)
```

Note there are multiple ways in which one can generate randomized structures and which methods work depends on your system. Here, we use a very simple approach.

```
def get_random_supercell_size(max_repeat, max_atoms, n_atoms_in_prim):
while True:
nx, ny, nz = np.random.randint(1, max_repeat + 1, size=3)
if nx * ny * nz * n_atoms_in_prim < max_atoms:
break
return nx, ny, nz
```

## Running the structure annealing¶

We are now going to run the annealing procedure.

```
n_random_structures = 10000
max_repeat = 8
max_atoms = 50
structures = []
for _ in range(n_random_structures):
# Create random supercell.
supercell = get_random_supercell_size(max_repeat, max_atoms, len(primitive_structure))
structure = primitive_structure.repeat(supercell)
# Randomize concentrations in the supercell
n_atoms = len(structure)
n_Au = np.random.randint(0, n_atoms)
n_Pd = n_atoms - n_Au
concentration = {'Au': n_Au / n_atoms, 'Pd': n_Pd / n_atoms}
# Occupy the structure randomly and store it.
occupy_structure_randomly(structure, cluster_space, concentration)
structures.append(structure)
```

Then you can collect the training structures like so

```
training_structures = [structures[ind] for ind in indices]
```

It is instructive to plot the condition number as a function of the accepted trial steps, to see benefit of the MCMC selection approach compared to just randomly selecting structure. This is particularly clear for small number of structures in the training set for which randomized selection yields a 1e16 condition number (i.e. a ill-conditioned linear problem), whereas the MCMC selection generates a reasonable condition number.

We can also compare the condition number for a random selection of structure. This is taken as the start of the metric trajectory, since if no starting indices is given the structures are randomly drawn from the structure pool:

```
==================== Condition number for annealing and random structures ====================
annealing structures: 31.17173090869272
random structures: 226.45827088170168
```

Here, we compare the condition number for random structures and those generated by the annealing approach as a function of number of training structures.

## Source code¶

The complete source code is available in
`examples/training_set_generation.py`

```
"""
This examples demonstrates how one can generate trainings structures
"""
# Import modules
import numpy as np
from ase.build import bulk
from icet import ClusterSpace
from icet.tools.structure_generation import occupy_structure_randomly
from icet.tools.training_set_generation import structure_selection_annealing
# For plotting
import matplotlib.pyplot as plt
# Convenience function for supercell size generation
def get_random_supercell_size(max_repeat, max_atoms, n_atoms_in_prim):
while True:
nx, ny, nz = np.random.randint(1, max_repeat + 1, size=3)
if nx * ny * nz * n_atoms_in_prim < max_atoms:
break
return nx, ny, nz
# Create the primitive structure and cluster space.
# The possible occupations are Au and Pd
primitive_structure = bulk('Au', 'fcc', 4.0)
cutoffs = [10.0, 6.0, 4.0]
subelements = ['Au', 'Pd']
cluster_space = ClusterSpace(primitive_structure, cutoffs, subelements)
# Create a random structure pool
n_random_structures = 10000
max_repeat = 8
max_atoms = 50
structures = []
for _ in range(n_random_structures):
# Create random supercell.
supercell = get_random_supercell_size(max_repeat, max_atoms, len(primitive_structure))
structure = primitive_structure.repeat(supercell)
# Randomize concentrations in the supercell
n_atoms = len(structure)
n_Au = np.random.randint(0, n_atoms)
n_Pd = n_atoms - n_Au
concentration = {'Au': n_Au / n_atoms, 'Pd': n_Pd / n_atoms}
# Occupy the structure randomly and store it.
occupy_structure_randomly(structure, cluster_space, concentration)
structures.append(structure)
# We want to add 2 times the number of parameters structures and we want the annealing to run
# for 1e4 steps.
n_structures_to_add = 2 * len(cluster_space)
n_steps = 10000
# start the annealing procedure to minimize the condition number of
# the fit matrix.
indices, traj = structure_selection_annealing(cluster_space, structures, n_structures_to_add,
n_steps)
condition_number_annealing = traj[-1]
# Since we start with random structures from the pool
# this represents choosing structures on random
condition_number_random_structures = traj[0]
# Collect the structures that were found to be good training structure candidates.
training_structures = [structures[ind] for ind in indices]
# Plot the metric vs accepted trials.
fig, ax = plt.subplots()
ax.plot(traj)
ax.set_xlabel('Accepted trials')
ax.set_ylabel('Condition number')
fig.tight_layout()
fig.savefig('training_set_generation_cond_traj.svg')
print('='*20 + ' Condition number for annealing and random structures ' + '='*20 + '\n'
f'annealing structures: {condition_number_annealing}\n'
f'random structures: {condition_number_random_structures}')
```

You can use this method even if you already have a set of training structures,
e.g., if you do the procedure iterativley.
`examples/training_set_generation_with_base.py`

```
"""
This examples demonstrates how one can generate trainings structures with a
base of structures already
"""
# Import modules
import numpy as np
from ase.build import bulk
from icet import ClusterSpace
from icet.tools.structure_generation import occupy_structure_randomly
from icet.tools.training_set_generation import structure_selection_annealing
# Convenience function for supercell size generation
def get_random_supercell_size(max_repeat, max_atoms, n_atoms_in_prim):
while True:
nx, ny, nz = np.random.randint(1, max_repeat + 1, size=3)
if nx * ny * nz * n_atoms_in_prim < max_atoms:
break
return nx, ny, nz
# Create the primitive structure and cluster space.
# The possible occupations are Au and Pd
primitive_structure = bulk('Au', 'fcc', 4.0)
subelements = ['Au', 'Pd']
cutoffs = [10.0, 6.0, 4.0]
cluster_space = ClusterSpace(primitive_structure, cutoffs, subelements)
# Create a random structure pool
n_random_structures = 10000
max_repeat = 8
max_atoms = 50
structures = []
for _ in range(n_random_structures):
# Create random supercell.
supercell = get_random_supercell_size(max_repeat, max_atoms, len(primitive_structure))
structure = primitive_structure.repeat(supercell)
# Randomize concentrations in the supercell
n_atoms = len(structure)
n_Au = np.random.randint(0, n_atoms)
n_Pd = n_atoms - n_Au
concentration = {'Au': n_Au / n_atoms, 'Pd': n_Pd / n_atoms}
# Occupy the structure randomly and store it.
occupy_structure_randomly(structure, cluster_space, concentration)
structures.append(structure)
# We take the first 5 randomly generated structures above and assume they
# were the base structures that we already have done calculations for.
base_structures = structures[0:5]
# We want to add 2 times the number of parameters structures and we want the annealing to run
# for 1e4 steps.
n_structures_to_add = 2 * len(cluster_space)
n_steps = 10000
# start the annealing procedure to minimize the condition number of the fit matrix,
# the base_structures are always included.
indices, traj = structure_selection_annealing(cluster_space, structures[5:], n_structures_to_add,
n_steps, base_structures=base_structures)
condition_number_base_structures = traj[-1]
# collect the extra structures
training_structures_extra = [structures[ind + 5] for ind in indices]
```