# Cluster expansions¶

This page provides a short introduction to cluster expansions. For more extensive descriptions, please consult for example [SanDucGra84], [Wal09] or [AngMunRah19]. A simpler introduction to the subject can be found in the first part of this tutorial.

## Formalism¶

In the following, we are concerned with configurations corresponding to a distribution of \(M\) different species over \(N\) sites that can be written as a vector \(\boldsymbol{\sigma}\), whose elements can assume \(M\) different values, e.g., from \(S=\{-m, -m+1, \ldots m-1, m\}\) where \(M=2m\) (for even \(M\)) or \(M=2m+1\) (for odd \(M\)) [SanDucGra84]. One now seeks to represent a property \(Q\) of the system, such as the total energy, as a function of \(\boldsymbol{\sigma}\), i.e. \(Q = f(\boldsymbol{\sigma})\). To this end, one can construct a complete orthonormal basis of cluster functions \(\Gamma_{\alpha}(\boldsymbol{\sigma})\) [SanDucGra84] [San10], which allows one to express \(Q\) in the form [Wal09]

Here, the sum extends over all symmetry equivalent clusters
(orbit) \(\alpha\), \(m_{\alpha}\) denotes the
multiplicity, whereas the coefficients \(J_{\alpha}\) are the
so-called effective cluster interactions (ECIs). The last term
in the above expression represents the average over cluster functions
\(\Gamma_{\alpha}(\boldsymbol{\sigma})\) belonging to symmetry
equivalent clusters. The cluster functions themselves are obtained as
a product over point functions \(\Theta\). In **icet**, the
point functions from [Wal09] are used:

These point functions form an orthogonal set over all possible occupation numbers. For further details, please consult [Wal09] or [AngMunRah19].

In **icet**, the formalism is handled internally by the
`ClusterSpace`

class.

## Cluster expansion construction¶

The task of training a CE can be formally written as a linear problem

where \(\boldsymbol{Q}\) is the vector of target properties, the vector \(\boldsymbol{J}\) represents the ECIs, and \(\mathbf{\bar{\Pi}}\) is a matrix that is obtained by stacking the vectors that represent each structure (also referred to in this documentation as cluster vectors) of the training set. Here, the multiplicities \(m_{\alpha}\) have been included in \(\mathbf{\bar{\Pi}}\).

This problem can be approached by choosing the number of structures \(n_{\boldsymbol{Q}}\) (and thus the dimensionality of \(\boldsymbol{Q}\)), to be (much) larger than the number of ECIs \(n_{\boldsymbol{J}}\) (and thus the dimensionality of \(\boldsymbol{J}\), (\(n_{\boldsymbol{Q}}>n_{\boldsymbol{J}}\)). The set of equations is thus overdetermined. The optimal set of ECIs for fixed training set and cluster function basis is then obtained by minimizing the \(l_2\)-norm of \(\mathbf{\bar{\Pi}} \boldsymbol{J} - \boldsymbol{Q}\)

Traditional algorithms [Wal09] then proceed by generating a series of CEs corresponding to different basis set choices, i.e. different values of \(n_{\boldsymbol{J}}\). By comparing the performance of each CE by means of a cross validation (CV) score the best performing CE is selected. An alternative approach is to choose the number of ECIs larger than the number of structures (\(n_{\boldsymbol{J}}>n_{\boldsymbol{Q}}\)). The resulting problem is then underdetermined, but can be solved by means of compressive sensing [CanWak08] [NelHarZho13] [NelOzoRee13]. In practice, one obtains a sparse solution, i.e., a solution in which many ECIs are zero. It is also possible to employ Bayesian approaches [MueCed09] or genetic algorithms [BluHarWal05].

**icet** is agnostic to a particular optimization approach and can in principle be used in conjunction with any of these techniques.
For many applications, the **trainstation** package, the documentation of which can be found here, provides a particular simple interface and many of our examples use this package.