# Cluster vector correlations¶

The purpose of this example is to demonstrate how to test the correlation between cluster vectors.

## Import modules¶

In the present case it is necessary to import two **icet** classes,
namely `ClusterSpace`

and
`Structure`

. The corresponding
objects are used to store information regarding a specific cluster space and
structure, respectively. Additionally, the ASE function
ase.db.connect
will be used to extract structures from a previously prepared database.

```
import random
import numpy as np
from ase.db import connect
from icet import ClusterSpace
```

## Function: Generate random structure¶

Since the intention is to calculate the correlation between the cluster vectors
for a number of structures in a database, first several functions are defined
that will be called in the loop over strucures described in the last section.
The first function, `generate_random_structure()`

, takes an ```
ASE
Atoms
```

object, `atoms_prim`

, creates a supercell with the dimensions
specified by `repeat`

and populates it randomly with species from the
`subelements`

list. In the end, the corresponding
`ASE Atoms`

object is returned.

```
def generate_random_structure(atoms_prim, subelements, repeat=8):
"""
Generate a random structure with atoms_prim as a base
and fill it randomly with symbols from subelements
"""
atoms = atoms_prim.copy().repeat(repeat)
for at in atoms:
element = random.choice(subelements)
at.symbol = element
return atoms
```

## Function: Generate CV set¶

The second function, `generate_cv_set()`

, generates cluster vectors for `n`

supercells, which have been constructed from the `ASE Atoms`

object,
`atoms_prim`

and randomly populated with the elements in the `subelements`

list, based on the `ClusterSpace`

object `cluster_space`

. Specifically,
this is achieved by first calling the `generate_random_structure()`

function,
defined in the previous section, with the `ASE Atoms`

object,
`atoms_prim`

, the `subelements`

list and size of the supercell,
`repeat`

, as input arguments. The resulting configuration is subsequently
converted to a `Structure`

object, using
the `Structure.from_atoms()`

method. This structure is, thereafter,
provided as an input argument to the `get_cluster_vector`

method to extract the cluster vectors.

```
def generate_cv_set(n, atoms_prim, subelements, clusterspace, repeat=8):
"""
Generate a set of cluster vectors from a cluster space
"""
clustervectors = []
for i in range(n):
conf = generate_random_structure(atoms_prim, subelements, repeat)
cv = clusterspace.get_cluster_vector(conf)
clustervectors.append(cv)
return clustervectors
```

## Function: Get column correlation¶

The third function, `get_column_correlation()`

, calculates the correlation
between the columns `i`

and `j`

in the matrix `cv_matrix`

, which should
be a `numpy.array`

object.

```
def get_column_correlation(i, j, cv_matrix):
"""
Returns the correlation between column i and j
cv_matrix: numpy matrix
"""
col_i = cv_matrix[:, i]
col_j = cv_matrix[:, j]
corr = np.dot(col_i, col_j) / \
(np.linalg.norm(col_i) * np.linalg.norm(col_j))
return corr
```

## Function: Check correlation¶

The fourth function, `assert_no_correlation()`

, checks that the correlations
between the columns in the matrix `cvs`

are smaller than the tolerance,
`tol`

. This is achieved by calculating the correlations between all pairs of
columns with help of the `get_column_correlation()`

, which was defined
earlier. If this value is not smaller than the tolerance, the number of the
columns and the correlation is printed to the standard output.

```
def assert_no_correlation(cvs, tol=0.99):
"""
Check that no column in cvs are above tolerance
"""
cvs_matrix = np.array(cvs)
for i in range(len(cvs[0])):
# Do not loop over zerolet since this is always ones
if i == 0:
continue
for j in range(len(cvs[0])):
if j <= i:
continue
corr = get_column_correlation(i, j, cvs_matrix)
assert corr < tol, 'columns {} and {} were correlated with'\
' {}'.format(i, j, corr)
```

## Setup¶

Before starting the test a list of the relevant subelements, namely Pd, H, and V, of which the latter is used for representing vacancies, is defined. Also, the cutoff for pairs is set to 2.0 Å and it is decided that the cluster vectors be calculated for \(8\times8\times8\) supercells.

```
subelements = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8
```

## Test the correlation¶

The functions defined above are now employed to test the cluster vector
correlations for the first ten structures in the database `PdHVac-fcc.db`

,
which should already have been created using the script
`enumerate_structures.py`

. At each step of the loop, some basic information
regarding the structure is first extracted and printed. Thereafter, a
`ClusterSpace`

object,
`cluster_space`

, is initiated using the `ASE Atoms`

object
`atoms_row`

as well as the `cutoffs`

list and the `subelements`

list, specified in the previous section, as input arguments.
Subsequently the previously defined function is called with `n=20`

and
`atoms_row`

, `subelements`

and `cluster_space`

as additional input
arguments. The list of cluster vectors thus obtained is then fed into
the `assert_no_correlation()`

function, that was described earlier, to check
if correlations between the columns are smaller than the tolerance specified
previously.

```
db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
atoms_row = row.toatoms()
atoms_id = row.id
atoms_form = row.formula
print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
atoms_id,
cutoffs))
atoms_row.wrap() # Wrap all atoms into the unit cell
cluster_space = ClusterSpace(atoms_row, cutoffs, subelements)
cvs = generate_cv_set(20, atoms_row, subelements, cluster_space, repeat)
assert_no_correlation(cvs)
print('Number of atoms: {} Length of cluster vector: {}'.format(
len(atoms_row.repeat(repeat)), len(cvs[0])))
```

The result of running the script should be the following:

```
Testing structure: PdH (id=1) with cutoffs [2.0]
Number of atoms: 1024 Length of cluster vector: 6
Testing structure: PdV (id=2) with cutoffs [2.0]
Number of atoms: 1024 Length of cluster vector: 6
Testing structure: Pd2VH (id=3) with cutoffs [2.0]
Number of atoms: 2048 Length of cluster vector: 6
Testing structure: Pd2VH (id=4) with cutoffs [2.0]
Number of atoms: 2048 Length of cluster vector: 6
Testing structure: Pd3VH2 (id=5) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3V2H (id=6) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3VH2 (id=7) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3V2H (id=8) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3VH2 (id=9) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3V2H (id=10) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
```

## Source code¶

`examples/test_cluster_vector_correlation.py`

```
"""
This example demonstrates how to checks the column correlation for a set of
clustervectors and asserts that none of the columns are highly correlated
"""
# Import modules
import random
import numpy as np
from ase.db import connect
from icet import ClusterSpace
# Function for generating random structures
def generate_random_structure(atoms_prim, subelements, repeat=8):
"""
Generate a random structure with atoms_prim as a base
and fill it randomly with symbols from subelements
"""
atoms = atoms_prim.copy().repeat(repeat)
for at in atoms:
element = random.choice(subelements)
at.symbol = element
return atoms
# Function for generating cluster vectors
def generate_cv_set(n, atoms_prim, subelements, clusterspace, repeat=8):
"""
Generate a set of cluster vectors from a cluster space
"""
clustervectors = []
for i in range(n):
conf = generate_random_structure(atoms_prim, subelements, repeat)
cv = clusterspace.get_cluster_vector(conf)
clustervectors.append(cv)
return clustervectors
# Function for calculating column correlations
def get_column_correlation(i, j, cv_matrix):
"""
Returns the correlation between column i and j
cv_matrix: numpy matrix
"""
col_i = cv_matrix[:, i]
col_j = cv_matrix[:, j]
corr = np.dot(col_i, col_j) / \
(np.linalg.norm(col_i) * np.linalg.norm(col_j))
return corr
# Function for asserting that columns are not correlated
def assert_no_correlation(cvs, tol=0.99):
"""
Check that no column in cvs are above tolerance
"""
cvs_matrix = np.array(cvs)
for i in range(len(cvs[0])):
# Do not loop over zerolet since this is always ones
if i == 0:
continue
for j in range(len(cvs[0])):
if j <= i:
continue
corr = get_column_correlation(i, j, cvs_matrix)
assert corr < tol, 'columns {} and {} were correlated with'\
' {}'.format(i, j, corr)
# Create a list of the subelements that shall be considered and set the
# cutoff distance for singlets to 2.0 Å.
subelements = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8
# Test the correlation between columns for a set of structures in a
# previously generated database.
db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
atoms_row = row.toatoms()
atoms_id = row.id
atoms_form = row.formula
print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
atoms_id,
cutoffs))
atoms_row.wrap() # Wrap all atoms into the unit cell
cluster_space = ClusterSpace(atoms_row, cutoffs, subelements)
cvs = generate_cv_set(20, atoms_row, subelements, cluster_space, repeat)
assert_no_correlation(cvs)
print('Number of atoms: {} Length of cluster vector: {}'.format(
len(atoms_row.repeat(repeat)), len(cvs[0])))
```