Cluster vector correlations

The purpose of this example is to demonstrate how one can test the correlation between structures via their cluster vectors.

Import modules

We need to import two icet classes, namely ClusterSpace and Structure. The corresponding objects are used to store information regarding a specific cluster space and structure, respectively. Additionally, the ASE function db.connect will be used to extract structures from a previously prepared database.

import random
import numpy as np

from ase.db import connect
from icet import ClusterSpace


Helper functions

Since the intention is to calculate the correlation between the cluster vectors for a number of structures in a database, first several functions are defined that will be called in the loop over strucures described in the last section.

Random structure generation

The first function, generate_random_structure(), takes an Atoms object creates a supercell with the specified dimensions, and populates it randomly with the allowed chemical elements. In the end, a new Atoms object is returned.

def generate_random_structure(atoms_prim, chemical_species, repeat=8):
    """
    Generate a random structure with atoms_prim as a base
    and fill it randomly with symbols from chemical_species
    """

    atoms = atoms_prim.copy().repeat(repeat)

    for at in atoms:
        at.symbol = random.choice(chemical_species)

    return atoms


Calculation of cluster vectors

The second function, generate_cv_set(), generates cluster vectors for a number of supercells, which have been constructed from an Atoms object and randomly populated with different species, based on a given ClusterSpace object. This is achieved by first calling the generate_random_structure() function, defined in the previous section. The thus obtained configuration is subsequently converted to a Structure object using the from_atoms() method, which is forwarded as an input argument to the get_cluster_vector() method to extract the cluster vector.

def generate_cv_set(n, atoms_prim, chemical_species, clusterspace, repeat=8):
    """
    Generate a set of cluster vectors from a cluster space
    """
    clustervectors = []
    for i in range(n):
        conf = generate_random_structure(atoms_prim, chemical_species, repeat)
        cv = clusterspace.get_cluster_vector(conf)
        clustervectors.append(cv)

    return clustervectors


Column correlation

The third function, get_column_correlation(), calculates the correlation between columns i and j in cv_matrix.

def get_column_correlation(i, j, cv_matrix):
    """
    Returns the correlation between column i and j

    cv_matrix: numpy matrix
    """
    col_i = cv_matrix[:, i]
    col_j = cv_matrix[:, j]

    corr = np.dot(col_i, col_j) / \
        (np.linalg.norm(col_i) * np.linalg.norm(col_j))

    return corr


Check correlation

The fourth function, assert_no_correlation(), checks that the correlations between the columns in the matrix cvs are smaller than the tolerance, tol. This is achieved by calculating the correlation between all pairs of columns with help of the get_column_correlation() defined earlier. If this value is not smaller than the tolerance, the number of the columns and the correlation is written to stdout.

def assert_no_correlation(cvs, tol=0.99):
    """
    Check that no column in cvs are above tolerance
    """
    cvs_matrix = np.array(cvs)
    for i in range(len(cvs[0])):
        # Do not loop over zerolet since this is always ones
        if i == 0:
            continue
        for j in range(len(cvs[0])):
            if j <= i:
                continue
            corr = get_column_correlation(i, j, cvs_matrix)
            assert corr < tol, 'columns {} and {} were correlated with'\
                ' {}'.format(i, j, corr)


Setup

Before starting the test, a list of the relevant chemical symbols, namely Pd, H, and V, of which the latter is used for representing vacancies, is defined. Also, the cutoff for pairs is set to 2.0 Å and it is decided that the cluster vectors be calculated for \(8\times8\times8\) supercells.

chemical_species = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8

Test correlation

The functions defined above are now employed to test the cluster vector correlations for the first ten structures in the database PdHVac-fcc.db, created in a previous example. At each step of the loop, some basic information regarding the structure is extracted. Thereafter, a ClusterSpace object is initiated. Subsequently a list of cluster vectors is fed into the assert_no_correlation() function to check if correlations between the columns are smaller than the tolerance specified previously.

db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
    atoms_row = row.toatoms()
    atoms_id = row.id
    atoms_form = row.formula
    print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
                                                                 atoms_id,
                                                                 cutoffs))
    atoms_row.wrap()  # Wrap all atoms into the unit cell
    cluster_space = ClusterSpace(atoms_row, cutoffs, chemical_species)

    cvs = generate_cv_set(20, atoms_row, chemical_species,
                          cluster_space, repeat)
    assert_no_correlation(cvs)
    print('Number of atoms: {}    Length of cluster vector: {}'.format(
        len(atoms_row.repeat(repeat)), len(cvs[0])))

The script produces the following output:

Testing structure: PdH (id=1) with cutoffs [2.0]
Number of atoms: 1024    Length of cluster vector: 6
Testing structure: PdV (id=2) with cutoffs [2.0]
Number of atoms: 1024    Length of cluster vector: 6
Testing structure: Pd2VH (id=3) with cutoffs [2.0]
Number of atoms: 2048    Length of cluster vector: 6
Testing structure: Pd2VH (id=4) with cutoffs [2.0]
Number of atoms: 2048    Length of cluster vector: 6
Testing structure: Pd3VH2 (id=5) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3V2H (id=6) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3VH2 (id=7) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3V2H (id=8) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3VH2 (id=9) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3V2H (id=10) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6

Source code

The complete source code is available in examples/test_cluster_vector_correlation.py
"""
This example demonstrates how to checks the column correlation for a set of
clustervectors and asserts that none of the columns are highly correlated
"""

# Import modules
import random
import numpy as np

from ase.db import connect
from icet import ClusterSpace


# Function for generating random structures
def generate_random_structure(atoms_prim, chemical_species, repeat=8):
    """
    Generate a random structure with atoms_prim as a base
    and fill it randomly with symbols from chemical_species
    """

    atoms = atoms_prim.copy().repeat(repeat)

    for at in atoms:
        at.symbol = random.choice(chemical_species)

    return atoms


# Function for generating cluster vectors
def generate_cv_set(n, atoms_prim, chemical_species, clusterspace, repeat=8):
    """
    Generate a set of cluster vectors from a cluster space
    """
    clustervectors = []
    for i in range(n):
        conf = generate_random_structure(atoms_prim, chemical_species, repeat)
        cv = clusterspace.get_cluster_vector(conf)
        clustervectors.append(cv)

    return clustervectors


# Function for calculating column correlations
def get_column_correlation(i, j, cv_matrix):
    """
    Returns the correlation between column i and j

    cv_matrix: numpy matrix
    """
    col_i = cv_matrix[:, i]
    col_j = cv_matrix[:, j]

    corr = np.dot(col_i, col_j) / \
        (np.linalg.norm(col_i) * np.linalg.norm(col_j))

    return corr


# Function for asserting that columns are not correlated
def assert_no_correlation(cvs, tol=0.99):
    """
    Check that no column in cvs are above tolerance
    """
    cvs_matrix = np.array(cvs)
    for i in range(len(cvs[0])):
        # Do not loop over zerolet since this is always ones
        if i == 0:
            continue
        for j in range(len(cvs[0])):
            if j <= i:
                continue
            corr = get_column_correlation(i, j, cvs_matrix)
            assert corr < tol, 'columns {} and {} were correlated with'\
                ' {}'.format(i, j, corr)


# Create a list of the chemical_species that shall be considered and set the
# cutoff distance for singlets to 2.0 Å.
chemical_species = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8

# Test the correlation between columns for a set of structures in a
# previously generated database.
db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
    atoms_row = row.toatoms()
    atoms_id = row.id
    atoms_form = row.formula
    print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
                                                                 atoms_id,
                                                                 cutoffs))
    atoms_row.wrap()  # Wrap all atoms into the unit cell
    cluster_space = ClusterSpace(atoms_row, cutoffs, chemical_species)

    cvs = generate_cv_set(20, atoms_row, chemical_species,
                          cluster_space, repeat)
    assert_no_correlation(cvs)
    print('Number of atoms: {}    Length of cluster vector: {}'.format(
        len(atoms_row.repeat(repeat)), len(cvs[0])))