# Background¶

(expand and add more references; some references to topics such as cluster functions are intentionally left broken as a reminder to cover these things)

## Cluster expansions¶

In the following, we are concerned with configurations corresponding to a distribution of $$M$$ different species over $$N$$ sites that can be written as a vector $$\boldsymbol{\sigma}$$, whose elements can assume $$M$$ different values, e.g., from $$S=\{-m, -m+1, \ldots m-1, m\}$$ where $$M=2m$$ (for even $$M$$) or $$M=2m+1$$ (for odd $$M$$) [SanDucGra84]. One now seeks to represent a property $$Q$$ of the system, such as the total energy, as a function of $$\boldsymbol{\sigma}$$, i.e. $$Q = f(\boldsymbol{\sigma})$$. To this end, one can construct a complete orthonormal basis of cluster functions $$\Gamma_{\alpha}(\boldsymbol{\sigma})$$ [SanDucGra84] [San10], which allows one to express $$Q$$ in the form [Wal09]

$Q = \sum_\alpha m_\alpha J_\alpha \left<\Gamma_{\alpha'}(\boldsymbol{\sigma})\right>_{\alpha}.$

Here, the sum extends over all symmetry equivalent clusters (orbit) $$\alpha$$, $$m_{\alpha}$$ denotes the multiplicity [1] whereas the coefficients $$J_{\alpha}$$ are the so-called effective cluster interactions (ECIs). The last term in the above expression represents the average over cluster functions $$\Gamma_{\alpha}(\boldsymbol{\sigma})$$ belonging to symmetry equivalent clusters (orbits). The cluster functions themselves are obtained as a product over basis functions $$\Theta$$ as described in the section detailing the construction of cluster functions.

 [1] Note that some authors include $$m_{\alpha}$$ in the symmetrized product over cluster functions $$\left<\Gamma_{\alpha'}(\boldsymbol{\sigma})\right>_{\alpha}$$.

## CE construction¶

The task of training a CE can be formally written as a linear problem

$\mathbf{\bar{\Pi}} \boldsymbol{J} = \boldsymbol{Q}$

where $$\boldsymbol{Q}$$ is the vector of target properties, the vector $$\boldsymbol{J}$$ represents the ECIs, and $$\mathbf{\bar{\Pi}}$$ is a matrix that is obtained by stacking the vectors that represent each structure (also referred to in this documentation as cluster vectors) of the training set. This problem can be approached by choosing the number of structures $$n_{\boldsymbol{Q}}$$ (and thus the dimensionality of $$\boldsymbol{Q}$$), to be (much) larger than the number of ECIs $$n_{\boldsymbol{J}}$$ (and thus the dimensionality of $$\boldsymbol{J}$$, ($$n_{\boldsymbol{Q}}>n_{\boldsymbol{J}}$$). The set of equations is thus overdetermined. The optimal set of ECIs for fixed training set and cluster function basis is then obtained by minimizing the $$l_2$$-norm of $$\mathbf{\bar{\Pi}} \boldsymbol{J} - \boldsymbol{Q}$$

$\boldsymbol{J}_{\text{opt}} = \arg\min_{\boldsymbol{J}} \left\{ || \mathbf{\bar{\Pi}} \boldsymbol{J} - \boldsymbol{Q} ||_2 \right\}.$

Common algorithms [Wal09] then proceed by generating a series of CEs corresponding to different basis set choices, i.e. different values of $$n_{\boldsymbol{J}}$$. By comparing the performance of each CE by means of a cross validation (CV) score the best performing CE is selected.