Cluster vector correlations

The purpose of this example is to demonstrate how to test the correlation between cluster vectors.

Import modules

In the present case it is necessary to import two icet classes, namely ClusterSpace and Structure. The corresponding objects are used to store information regarding a specific cluster space and structure, respectively. Additionally, the ASE function ase.db.connect will be used to extract structures from a previously prepared database.

import random
import numpy as np

from ase.db import connect
from icet import ClusterSpace


Function: Generate random structure

Since the intention is to calculate the correlation between the cluster vectors for a number of structures in a database, first several functions are defined that will be called in the loop over strucures described in the last section. The first function, generate_random_structure(), takes an ASE Atoms object, atoms_prim, creates a supercell with the dimensions specified by repeat and populates it randomly with species from the subelements list. In the end, the corresponding ASE Atoms object is returned.

def generate_random_structure(atoms_prim, subelements, repeat=8):
    """
    Generate a random structure with atoms_prim as a base
    and fill it randomly with symbols from subelements
    """

    atoms = atoms_prim.copy().repeat(repeat)

    for at in atoms:
        element = random.choice(subelements)
        at.symbol = element

    return atoms


Function: Generate CV set

The second function, generate_cv_set(), generates cluster vectors for n supercells, which have been constructed from the ASE Atoms object, atoms_prim and randomly populated with the elements in the subelements list, based on the ClusterSpace object cluster_space. Specifically, this is achieved by first calling the generate_random_structure() function, defined in the previous section, with the ASE Atoms object, atoms_prim, the subelements list and size of the supercell, repeat, as input arguments. The resulting configuration is subsequently converted to a Structure object, using the Structure.from_atoms() method. This structure is, thereafter, provided as an input argument to the get_cluster_vector method to extract the cluster vectors.

def generate_cv_set(n, atoms_prim, subelements, clusterspace, repeat=8):
    """
    Generate a set of cluster vectors from a cluster space
    """
    clustervectors = []
    for i in range(n):
        conf = generate_random_structure(atoms_prim, subelements, repeat)
        cv = clusterspace.get_cluster_vector(conf)
        clustervectors.append(cv)

    return clustervectors


Function: Get column correlation

The third function, get_column_correlation(), calculates the correlation between the columns i and j in the matrix cv_matrix, which should be a numpy.array object.

def get_column_correlation(i, j, cv_matrix):
    """
    Returns the correlation between column i and j

    cv_matrix: numpy matrix
    """
    col_i = cv_matrix[:, i]
    col_j = cv_matrix[:, j]

    corr = np.dot(col_i, col_j) / \
        (np.linalg.norm(col_i) * np.linalg.norm(col_j))

    return corr


Function: Check correlation

The fourth function, assert_no_correlation(), checks that the correlations between the columns in the matrix cvs are smaller than the tolerance, tol. This is achieved by calculating the correlations between all pairs of columns with help of the get_column_correlation(), which was defined earlier. If this value is not smaller than the tolerance, the number of the columns and the correlation is printed to the standard output.

def assert_no_correlation(cvs, tol=0.99):
    """
    Check that no column in cvs are above tolerance
    """
    cvs_matrix = np.array(cvs)
    for i in range(len(cvs[0])):
        # Do not loop over zerolet since this is always ones
        if i == 0:
            continue
        for j in range(len(cvs[0])):
            if j <= i:
                continue
            corr = get_column_correlation(i, j, cvs_matrix)
            assert corr < tol, 'columns {} and {} were correlated with'\
                ' {}'.format(i, j, corr)


Setup

Before starting the test a list of the relevant subelements, namely Pd, H, and V, of which the latter is used for representing vacancies, is defined. Also, the cutoff for pairs is set to 2.0 Å and it is decided that the cluster vectors be calculated for \(8\times8\times8\) supercells.

subelements = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8

Test the correlation

The functions defined above are now employed to test the cluster vector correlations for the first ten structures in the database PdHVac-fcc.db, which should already have been created using the script enumerate_structures.py. At each step of the loop, some basic information regarding the structure is first extracted and printed. Thereafter, a ClusterSpace object, cluster_space, is initiated using the ASE Atoms object atoms_row as well as the cutoffs list and the subelements list, specified in the previous section, as input arguments. Subsequently the previously defined function is called with n=20 and atoms_row, subelements and cluster_space as additional input arguments. The list of cluster vectors thus obtained is then fed into the assert_no_correlation() function, that was described earlier, to check if correlations between the columns are smaller than the tolerance specified previously.

db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
    atoms_row = row.toatoms()
    atoms_id = row.id
    atoms_form = row.formula
    print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
                                                                 atoms_id,
                                                                 cutoffs))
    atoms_row.wrap()  # Wrap all atoms into the unit cell
    cluster_space = ClusterSpace(atoms_row, cutoffs, subelements)

    cvs = generate_cv_set(20, atoms_row, subelements, cluster_space, repeat)
    assert_no_correlation(cvs)
    print('Number of atoms: {}    Length of cluster vector: {}'.format(
        len(atoms_row.repeat(repeat)), len(cvs[0])))

The result of running the script should be the following:

Testing structure: PdH (id=1) with cutoffs [2.0]
Number of atoms: 1024    Length of cluster vector: 6
Testing structure: PdV (id=2) with cutoffs [2.0]
Number of atoms: 1024    Length of cluster vector: 6
Testing structure: Pd2VH (id=3) with cutoffs [2.0]
Number of atoms: 2048    Length of cluster vector: 6
Testing structure: Pd2VH (id=4) with cutoffs [2.0]
Number of atoms: 2048    Length of cluster vector: 6
Testing structure: Pd3VH2 (id=5) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3V2H (id=6) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3VH2 (id=7) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3V2H (id=8) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3VH2 (id=9) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6
Testing structure: Pd3V2H (id=10) with cutoffs [2.0]
Number of atoms: 3072    Length of cluster vector: 6

Source code

The complete source code is available in examples/test_cluster_vector_correlation.py
"""
This example demonstrates how to checks the column correlation for a set of
clustervectors and asserts that none of the columns are highly correlated
"""

# Import modules
import random
import numpy as np

from ase.db import connect
from icet import ClusterSpace


# Function for generating random structures
def generate_random_structure(atoms_prim, subelements, repeat=8):
    """
    Generate a random structure with atoms_prim as a base
    and fill it randomly with symbols from subelements
    """

    atoms = atoms_prim.copy().repeat(repeat)

    for at in atoms:
        element = random.choice(subelements)
        at.symbol = element

    return atoms


# Function for generating cluster vectors
def generate_cv_set(n, atoms_prim, subelements, clusterspace, repeat=8):
    """
    Generate a set of cluster vectors from a cluster space
    """
    clustervectors = []
    for i in range(n):
        conf = generate_random_structure(atoms_prim, subelements, repeat)
        cv = clusterspace.get_cluster_vector(conf)
        clustervectors.append(cv)

    return clustervectors


# Function for calculating column correlations
def get_column_correlation(i, j, cv_matrix):
    """
    Returns the correlation between column i and j

    cv_matrix: numpy matrix
    """
    col_i = cv_matrix[:, i]
    col_j = cv_matrix[:, j]

    corr = np.dot(col_i, col_j) / \
        (np.linalg.norm(col_i) * np.linalg.norm(col_j))

    return corr


# Function for asserting that columns are not correlated
def assert_no_correlation(cvs, tol=0.99):
    """
    Check that no column in cvs are above tolerance
    """
    cvs_matrix = np.array(cvs)
    for i in range(len(cvs[0])):
        # Do not loop over zerolet since this is always ones
        if i == 0:
            continue
        for j in range(len(cvs[0])):
            if j <= i:
                continue
            corr = get_column_correlation(i, j, cvs_matrix)
            assert corr < tol, 'columns {} and {} were correlated with'\
                ' {}'.format(i, j, corr)


# Create a list of the subelements that shall be considered and set the
# cutoff distance for singlets to 2.0 Å.
subelements = ['Pd', 'H', 'V']
cutoffs = [2.0]
repeat = 8

# Test the correlation between columns for a set of structures in a
# previously generated database.
db = connect('PdHVac-fcc.db')
for row in db.select('id<=10'):
    atoms_row = row.toatoms()
    atoms_id = row.id
    atoms_form = row.formula
    print('Testing structure: {} (id={}) with cutoffs {}'.format(atoms_form,
                                                                 atoms_id,
                                                                 cutoffs))
    atoms_row.wrap()  # Wrap all atoms into the unit cell
    cluster_space = ClusterSpace(atoms_row, cutoffs, subelements)

    cvs = generate_cv_set(20, atoms_row, subelements, cluster_space, repeat)
    assert_no_correlation(cvs)
    print('Number of atoms: {}    Length of cluster vector: {}'.format(
        len(atoms_row.repeat(repeat)), len(cvs[0])))