# Cluster expansions¶

This page provides a short introduction to cluster expansions. For more extensive descriptions, please consult for example [SanDucGra84], [Wal09] or [AngMunRah19]. A simpler introduction to the subject can be found in the first part of this tutorial.

## Formalism¶

In the following, we are concerned with configurations corresponding to a distribution of \(M\) different species over \(N\) sites that can be written as a vector \(\boldsymbol{\sigma}\), whose elements can assume \(M\) different values, e.g., from \(S=\{-m, -m+1, \ldots m-1, m\}\) where \(M=2m\) (for even \(M\)) or \(M=2m+1\) (for odd \(M\)) [SanDucGra84]. One now seeks to represent a property \(Q\) of the system, such as the total energy, as a function of \(\boldsymbol{\sigma}\), i.e. \(Q = f(\boldsymbol{\sigma})\). To this end, one can construct a complete orthonormal basis of cluster functions \(\Gamma_{\alpha}(\boldsymbol{\sigma})\) [SanDucGra84] [San10], which allows one to express \(Q\) in the form [Wal09]

Here, the sum extends over all symmetry equivalent clusters (orbit)
\(\alpha\), \(m_{\alpha}\) denotes the multiplicity [1] whereas the
coefficients \(J_{\alpha}\) are the so-called effective cluster
interactions (ECIs). The last term in the above expression represents the
average over cluster functions \(\Gamma_{\alpha}(\boldsymbol{\sigma})\)
belonging to symmetry equivalent clusters (orbits). The cluster
functions themselves are obtained as a product over point functions
\(\Theta\). In **icet**, the point functions from [Wal09] are
used:

These point functions form an orthogonal set over all possible occupation numbers. For further details, please consult [Wal09] or [AngMunRah19].

In **icet**, the formalism is handled internally by the
`ClusterSpace`

class.

[1] | Note that some authors include \(m_{\alpha}\) in the symmetrized product over cluster functions \(\left<\Gamma_{\alpha'}(\boldsymbol{\sigma})\right>_{\alpha}\). |

## Cluster expansion construction¶

The task of training a CE can be formally written as a linear problem

where \(\boldsymbol{Q}\) is the vector of target properties, the vector \(\boldsymbol{J}\) represents the ECIs, and \(\mathbf{\bar{\Pi}}\) is a matrix that is obtained by stacking the vectors that represent each structure (also referred to in this documentation as cluster vectors) of the training set. This problem can be approached by choosing the number of structures \(n_{\boldsymbol{Q}}\) (and thus the dimensionality of \(\boldsymbol{Q}\)), to be (much) larger than the number of ECIs \(n_{\boldsymbol{J}}\) (and thus the dimensionality of \(\boldsymbol{J}\), (\(n_{\boldsymbol{Q}}>n_{\boldsymbol{J}}\)). The set of equations is thus overdetermined. The optimal set of ECIs for fixed training set and cluster function basis is then obtained by minimizing the \(l_2\)-norm of \(\mathbf{\bar{\Pi}} \boldsymbol{J} - \boldsymbol{Q}\)

Common algorithms [Wal09] then proceed by generating a series of CEs corresponding to different basis set choices, i.e. different values of \(n_{\boldsymbol{J}}\). By comparing the performance of each CE by means of a cross validation (CV) score the best performing CE is selected.

An alternative approach is to choose the number of ECIs larger than the number of structures (\(n_{\boldsymbol{J}}>n_{\boldsymbol{Q}}\)). The resulting problem is then underdetermined, but can be solved by means of compressive sensing [CanWak08] [NelHarZho13] [NelOzoRee13]. In practice, one obtains a sparse solution, i.e., a solution in which many ECIs are zero.

**icet** provides extensive functionality for the
present task, see further under Optimizers.