# Cluster vector correlations¶

The purpose of this example is to demonstrate how one can test the correlation between structures via their cluster vectors.

## Import modules¶

We need to import two **icet** classes, namely
`ClusterSpace`

and `Structure`

. The corresponding
objects are used to store information regarding a specific cluster space and
structure, respectively. Additionally, the ASE function `db.connect`

will be
used to extract structures from a previously prepared database.

```
import random
import numpy as np
from ase.db import connect
from icet import ClusterSpace
```

## Helper functions¶

Since the intention is to calculate the correlation between the cluster vectors for a number of structures in a database, first several functions are defined that will be called in the loop over strucures described in the last section.

### Random structure generation¶

The first function, `generate_random_structure()`

, takes an
`Atoms`

object creates a supercell with the specified
dimensions, and populates it randomly with the allowed chemical
elements. In the end, a new `Atoms`

object is returned.

```
def generate_random_structure(atoms_prim, chemical_species, repeat=8):
"""
Generate a random structure with atoms_prim as a base
and fill it randomly with symbols from chemical_species
"""
atoms = atoms_prim.copy().repeat(repeat)
for at in atoms:
at.symbol = random.choice(chemical_species)
return atoms
```

### Calculation of cluster vectors¶

The second function, `generate_cv_set()`

, generates cluster vectors for a
number of supercells, which have been constructed from an `Atoms`

object and randomly populated with different species, based on a given
`ClusterSpace`

object. This is
achieved by first calling the `generate_random_structure()`

function,
defined in the previous section. The thus obtained configuration is
subsequently converted to a `Structure`

object using the
`from_atoms()`

method, which is forwarded as an input
argument to the `get_cluster_vector()`

method to extract
the cluster vector.

```
def generate_cv_set(n, atoms_prim, chemical_species, clusterspace, repeat=8):
"""
Generate a set of cluster vectors from a cluster space
"""
clustervectors = []
for i in range(n):
conf = generate_random_structure(atoms_prim, chemical_species, repeat)
cv = clusterspace.get_cluster_vector(conf)
clustervectors.append(cv)
return clustervectors
```

### Column correlation¶

The third function, `get_column_correlation()`

, calculates the correlation
between columns `i`

and `j`

in `cv_matrix`

.

```
def get_column_correlation(i, j, cv_matrix):
"""
Returns the correlation between column i and j
cv_matrix: numpy matrix
"""
col_i = cv_matrix[:, i]
col_j = cv_matrix[:, j]
corr = np.dot(col_i, col_j) / \
(np.linalg.norm(col_i) * np.linalg.norm(col_j))
return corr
```

### Check correlation¶

The fourth function, `assert_no_correlation()`

, checks that the
correlations between the columns in the matrix `cvs`

are smaller
than the tolerance, `tol`

. This is achieved by calculating the
correlation between all pairs of columns with help of the
`get_column_correlation()`

defined earlier. If this value is not
smaller than the tolerance, the number of the columns and the
correlation is written to stdout.

```
def assert_no_correlation(cvs, tol=0.99):
"""
Check that no column in cvs are above tolerance
"""
cvs_matrix = np.array(cvs)
for i in range(len(cvs[0])):
# Do not loop over zerolet since this is always ones
if i == 0:
continue
for j in range(len(cvs[0])):
if j <= i:
continue
corr = get_column_correlation(i, j, cvs_matrix)
assert corr < tol, 'columns {} and {} were correlated with'\
' {}'.format(i, j, corr)
```

## Setup¶

Before starting the test, a list of the relevant chemical symbols, namely Pd, H, and V, of which the latter is used for representing vacancies, is defined. Also, the cutoff for pairs is set to 2.0 Å and it is decided that the cluster vectors be calculated for \(8\times8\times8\) supercells.

```
chemical_species = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8
```

## Test correlation¶

The functions defined above are now employed to test the cluster
vector correlations for the first ten structures in the database
`PdHVac-fcc.db`

, created in a previous example. At each step of the loop, some basic
information regarding the structure is extracted. Thereafter, a
`ClusterSpace`

object is initiated. Subsequently a list
of cluster vectors is fed into the `assert_no_correlation()`

function to check if correlations between the columns are smaller than
the tolerance specified previously.

```
db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
atoms_row = row.toatoms()
atoms_id = row.id
atoms_form = row.formula
print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
atoms_id,
cutoffs))
atoms_row.wrap() # Wrap all atoms into the unit cell
cluster_space = ClusterSpace(atoms_row, cutoffs, chemical_species)
cvs = generate_cv_set(20, atoms_row, chemical_species,
cluster_space, repeat)
assert_no_correlation(cvs)
print('Number of atoms: {} Length of cluster vector: {}'.format(
len(atoms_row.repeat(repeat)), len(cvs[0])))
```

The script produces the following output:

```
Testing structure: PdH (id=1) with cutoffs [2.0]
Number of atoms: 1024 Length of cluster vector: 6
Testing structure: PdV (id=2) with cutoffs [2.0]
Number of atoms: 1024 Length of cluster vector: 6
Testing structure: Pd2VH (id=3) with cutoffs [2.0]
Number of atoms: 2048 Length of cluster vector: 6
Testing structure: Pd2VH (id=4) with cutoffs [2.0]
Number of atoms: 2048 Length of cluster vector: 6
Testing structure: Pd3VH2 (id=5) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3V2H (id=6) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3VH2 (id=7) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3V2H (id=8) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3VH2 (id=9) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
Testing structure: Pd3V2H (id=10) with cutoffs [2.0]
Number of atoms: 3072 Length of cluster vector: 6
```

## Source code¶

`examples/test_cluster_vector_correlation.py`

```
"""
This example demonstrates how to checks the column correlation for a set of
clustervectors and asserts that none of the columns are highly correlated
"""
# Import modules
import random
import numpy as np
from ase.db import connect
from icet import ClusterSpace
# Function for generating random structures
def generate_random_structure(atoms_prim, chemical_species, repeat=8):
"""
Generate a random structure with atoms_prim as a base
and fill it randomly with symbols from chemical_species
"""
atoms = atoms_prim.copy().repeat(repeat)
for at in atoms:
at.symbol = random.choice(chemical_species)
return atoms
# Function for generating cluster vectors
def generate_cv_set(n, atoms_prim, chemical_species, clusterspace, repeat=8):
"""
Generate a set of cluster vectors from a cluster space
"""
clustervectors = []
for i in range(n):
conf = generate_random_structure(atoms_prim, chemical_species, repeat)
cv = clusterspace.get_cluster_vector(conf)
clustervectors.append(cv)
return clustervectors
# Function for calculating column correlations
def get_column_correlation(i, j, cv_matrix):
"""
Returns the correlation between column i and j
cv_matrix: numpy matrix
"""
col_i = cv_matrix[:, i]
col_j = cv_matrix[:, j]
corr = np.dot(col_i, col_j) / \
(np.linalg.norm(col_i) * np.linalg.norm(col_j))
return corr
# Function for asserting that columns are not correlated
def assert_no_correlation(cvs, tol=0.99):
"""
Check that no column in cvs are above tolerance
"""
cvs_matrix = np.array(cvs)
for i in range(len(cvs[0])):
# Do not loop over zerolet since this is always ones
if i == 0:
continue
for j in range(len(cvs[0])):
if j <= i:
continue
corr = get_column_correlation(i, j, cvs_matrix)
assert corr < tol, 'columns {} and {} were correlated with'\
' {}'.format(i, j, corr)
# Create a list of the chemical_species that shall be considered and set the
# cutoff distance for singlets to 2.0 Å.
chemical_species = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8
# Test the correlation between columns for a set of structures in a
# previously generated database.
db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
atoms_row = row.toatoms()
atoms_id = row.id
atoms_form = row.formula
print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
atoms_id,
cutoffs))
atoms_row.wrap() # Wrap all atoms into the unit cell
cluster_space = ClusterSpace(atoms_row, cutoffs, chemical_species)
cvs = generate_cv_set(20, atoms_row, chemical_species,
cluster_space, repeat)
assert_no_correlation(cvs)
print('Number of atoms: {} Length of cluster vector: {}'.format(
len(atoms_row.repeat(repeat)), len(cvs[0])))
```