E. de La Fortelle, J. Irwin

MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH England

*eric@mrc-lmb.cam.ac.uk, ji10@mrc-lmb.cam.ac.uk
http://Lagrange.mrc-lmb.cam.ac.uk*

and G. Bricogne

MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH England
and LURE, bât 209d, F-91405 Orsay Cedex, France

*gb10@mrc-lmb.cam.ac.uk http://gerard2.mrc-lmb.cam.ac.uk*

**Abstract**

*The problem of estimating heavy-atom parameters (esp. occupancies)
from acentric reflexions in the MIR method has a long history of difficulties,
and a conceptually satisfactory solution allowing bias-free refinement of all
parameters (including the lack of isomorphism) has only recently been obtained
by a recourse to the method of maximum-likelihood estimation. The situation is
essentially identical in the case of MAD phasing. The maximum-likelihood method
needs to be invoked to exploit incomplete phase information in a heavy-atom
parameter refinement while preventing that information from biasing the
results.*

*We have designed and written from scratch a computer program
- SHARP (Statistical Heavy-Atom Refinement and Phasing) - that fully
implements the maximum-likelihood approach. It can refine simultaneously scale,
a model for the lack of isomorphism and all heavy-atom parameters from MIR and
MAD data, or any mixture of them. The program has been systematically tested,
both on synthetic and on measured data, and compared to MLPHARE. The results
show the superiority of our approach, especially in cases of low
signal-to-noise ratio. The likelihood function has also been used as a
detection tool to plot residual Fourier maps and probe for minor sites, and for
the calculation of phase probability distributions encoded in
Hendrickson-Lattman coefficients.*

At the same time as the first attempts were being made to use phase estimates, an alternative refinement scheme was devised by Rossmann [7], based on a difference-Patterson correlation criterion, and evolved towards the "FHLE method" [8],[9], and finally the "origin-removed Patterson-correlation function" [10]. Here the use of acentric phase estimates is avoided altogether, but at the price of impoverishing the available information in the sense that multiple derivatives are not allowed to assist each other's refinement through the generation of phase information.

Sygusch [11] recognized that a middle-ground could perhaps
be found if the
acentric phases were no longer deemed to be "estimates", but were instead
treated as extra parameters and refined along with the others. Unfortunately,
the enormous increase in the number of variables dictated the use of a diagonal
approximation, which rather defeated the original purpose of accommodating the
correlations between phases and parameters. Bricogne [12],
[13] proposed a
solution that partially overcame these difficulties. The main idea was that
structure-factor estimates for acentric reflexions are *implicit
functions* of the parameters that are being refined. This dependence was
shown to result (*via* the chain rule) in a correction to the partial
derivatives from which the normal equations of the least-squares method are to
be formed. Many previously observed pathologies, such as the rapid divergence
of the site occupancies of good derivatives, were cured by this analysis, but
slower-moving instabilities were observed that resulted in divergent behaviour
of the estimates for the *lack of isomorphism* of the various derivatives.
Moreover, the problem of bimodality remained.

At this point, compliance with the first cardinal rule of the least-squares
method had been essentially restored, but attention was drawn to the violation
of a second cardinal rule : the inverse-variance 'weights' in the expression
for the least-squares residual should be kept *fixed* as if they were part
of the observed data. Since the method of least-squares is a special case of
the maximum-likelihood method when errors are normally distributed with fixed
(co)variances, it is clear that the problem of properly estimating the
lack-of-isomorphism parameters demanded a fully-fledged maximum-likelihood
treatment rather than least-squares.

Perusal of the literature shows that two-dimensional statistical 'phasing'
(probability distribution on the phase and on the modulus of the native
structure factor) had been considered as early as 1970 [14],
leading to the
first mention of likelihood in this context by Einsein [15].
The first mention
of parameter estimation by maximum-likelihood, in a very limited context, is
found in Green [16]. Maximum-likelihood (ML) refinement
for heavy-atom
parameters was then advocated by Bricogne
[17],[18],[19],
Read [20], and an approximation to it was implemented by
Otwinowski [21] in the program MLPHARE.
This program is only a partial
implementation of ML refinement - best described as 'phase-integrated
least-squares' - in the sense that (i) it integrates the exponential of the
least-squares residual and its partial derivatives only over the phase of the
native structure factor (not over its modulus) ; and that (ii) the lack of
isomorphism is still *re-estimated* at the end of each refinement cycle
rather than being *refined*, and may often converge to non-optimal values.
Nevertheless, this approach has been shown in numerous cases to yield better
results than earlier refinements, drawing attention to the potential of
maximum-likelihood methods.

The maximum-likelihood formalism outlined in Bricogne [22] for the MIR and SIR cases forms the basis of the present work. We will describe here its extension to probability distributions incorporating anomalous diffraction effects as well as measurement error and non-isomorphism. Integrating these distributions in the whole complex plane leads to likelihood functions that can be used for heavy-atom detection and refinement, and for producing phase probability distributions. We will also describe the current implementation of this formalism in a computer program, named SHARP (for Statistical Heavy-Atom Refinement and Phasing) [23].

More specifically, a least-squares (LS) model is always formulated as a
prescription for turning given values of model parameters into 'calculated'
(error-free) values to be compared with the observables. Error estimates are
obtained *a posteriori*, by examining the residual discrepancy between the
'calculated' and the 'observed' quantities. By contrast, a likelihood-based
model casts its predictions directly in the form of probability distributions
for the observables, the quantities called 'calculated' in the LS formalism
usually appearing as parameters in these distributions.

For a centric reflexion, the probablity distribution becomes one-dimensional, but the theory is essentially similar.

This probability distribution is then transformed into a likelihood distribution for that reflexion, via the simple rule (in the absence of prior phase information) :

( {g} , **F**^{P}_{*}(**h**) )
= p( **F**^{P}_{*}(**h**) | {g} )

Note that this equation is valid at each *trial point*
**F**^{P}_{*}(**h**) in the Harker plane. In order to have a
likelihood function that is independent of assumptions on the native complex
structure factor, we must now integrate the likelihood function over all
possible values of **F**^{P}_{*}(**h**) :

( {g} ) =
_{}
( {g} , **F**^{P}_{*}(**h**) )
d^{2}**F**^{P}_{*})

In the case of a centric reflexion, the integration is one-dimensional only, along the axis defined by the centric phase.

Future developments will incorporate a parametrisation of the anisotropy of anomalous scattering [24],[25] and will allow a refinement of the corresponding parameters from unmerged data carrying suitable goniometric information for each measurement.

k_{j}(**h**) = K^{sc}_{j} exp[-1/4
B(^{sc}_{j})(d*)^{2})
exp[-(_{}
b^{p,q}_{j}**h**^{p}**h**^{q})]

This error can be broken down in three main components :

* The measurement error, that is part of the crystallographic data and not refined.

* The physical lack-of-isomorphism error.

In the absence of structural evidence for 'localised' lack-of-isomorphism, our assumption will be that of Luzzati [26] that there is a random isotropic positional perturbation, with spatially uniform mean amplitude and normal distribution, over the whole asymmetric unit. Based on this hypothesis, following the work of Read [27] and Dumas [28], we used a one-parameter model for the physical lack-of-isomorphism variance, increasing with resolution.

* The model error.

This error has the same effect on the statistical distribution of the native structure factor as the previous one, but its variance is approximately decreasing with resolution as the mean intensity of remaining heavy atoms. We used a two-parameter model (a constant and a temperature factor) for this error.

A similar parametrisation is used for the error on the anomalous differences. Although there is no physical basis for adopting the same model, it was thought flexible enough as a function of resolution to fit to more diverse functions of resolution.

These maps enable the detection of minor sites, and perform this task in an optimal fashion because they take into account the full unbiased phase information available from the data at the current stage of refinement. They are essentially Fourier syntheses calculated from inverse-variance weighted difference coefficients between the derivative and native data. Their enhanced sensitivity to any departure from the current heavy-atom model (when the data are accurate enough, and to high enough resolution) makes them the instrument of choice to detect more subtle features, such as anisotropy in the heavy-atom temperature factors or structural disorder at certain sites.

In order to offer *ab initio* detection capability, another type of map
will be added to the existing program. Its coefficients will initially involve
second-order derivatives of the log-likelihood
function associated to the null hypothesis defined by "all intensity
differences between data sets are caused by lack of isomorphism". This map will
have the character of a Buerger sum function over a weighted
difference-Patterson function [30]. As major sites are
detected and included in
the substitution model, the log-likelihood function will develop first-order
derivatives giving rise to a difference-Fourier component in the residual map,
while the revised second-order derivatives will keep contributing a component
with the character of a sum function over a residual difference-Patterson.

The whole process of detecting sites and of assessing their significance quantitatively can thus be automated, using the log-likelihood gain referred to the null hypothesis as a scoring criterion for the peak-search. The procedure will stop when the highest remaining peak in the residual maps is essentially at the level of the noise.

Once all heavy atoms have been detected and refined, the remaining features in the 'isomorphous' residual maps, if they are significant, can provide the basis for a systematic study of lack of isomorphism. This could improve the rather crude way in which 'global' and 'local' lack of isomorphism have hitherto been described.

The result is a forms-based interface, written in HTML language and processed by Perl scripts, that exactly mirrors the hierarchy of parameters during the buildup of the parameters file, and that connects automatically to Graphical Helper Applications to facilitate inspection of the output.

In the same way, the documentation can be accessed in a context-sensitive manner by clicking on hyperlinks called 'explanation', scattered at all points of interest in the master file and in secondary details files.

Graphical applications are triggered through a Unix "mailcap" mechanism, that relies on the extension of a file name to determine what program to use for visualising the contents. All statistics relative to the data (histograms of intensity, of isomorphous and anomalous differences) and to the phasing (lack of isomorphism statistics, phasing power and Rcullis) can be visualised that way, and maps can be plotted in programs npo [31] or O [32] by pressing a button in the interface, without having to program specific instructions for these graphical tools.

The starting heavy-atom model consisted in two selenium atoms with isotropic thermal motion. Refinement of this model showed that, consistently with the results of other refinement procedures, the second selenium atom had a high temperature factor (around 60). Once the refinement was completed, the residual maps showed strong anisotropic features for the first selenium site and weaker anisotropy for the second. We consequently updated the heavy-atom model by allowing an anisotropic temperature factor for both seleniums atoms. The resulting residual map showed much less features above the noise level, except for a 10 peak at 1.8 Å distance from the first selenium site. The second update of the heavy-atom model allowed for a third selenium atom with an isotropic temperature factor, that refined to a low occupancy (0.2). The remarkable result was that the added occupancies of site 1 and site 3 were equal to the the occupancy of site 2 within the standard deviation of this parameter. This observation, added to the small distance between site 1 and site 3, shows that this methionine residue has a double conformation.

We then used the density modification program SOLOMON [35] to improve the phases, assuming that it would yield better results when the input phase probability distributions (encoded as Hendrickson-Lattman coefficients) are statistically more accurate. The density modification prodedure for both SHARP and MLPHARE was exactly similar. The results are summarized in Table 1.

The starting heavy-atom model consisted of two mercury sites, for which coordinates, occupancy and temperature factor were determined in a first round of refinement. The residual map plotted at the end of this refinement showed strong anisotropy features for both sites, and a suspicion of a double position for site 1. The anisotropy was refined first, and the subsequent residual map clearly showed that the cysteine residue to which the first mercury was bound had a double conformation. Once this was taken into account in the heavy-atom model in a third round of refinement, the residual map showed no more significant features, thus proving that the refinement was complete. The resulting map, after density modification in SOLOMON, was of high quality (see Table 2).

Interestingly, in this case the anomalous residual map yielded a much clearer
information that the isomorphous redsidual map. This was confirmed by the
phasing power statistics, which showed that, due to significant lack of
isomorphism, the anomalous signal contributed far more to the phasing than the
isomorphous signal. The whole procedure of refinement and phasing was then
started again from the same initial assumptions, but *without using the
native data*. Heavy-atom refinement yielded the same results, and the
residual maps allowed unambiguous detection of both the anisotropic thermal
motion and the double conformation. Phasing of this "Single-Wavelength
Anomalous" dataset, followed by the same solvent-flattening procedure, yielded
an interpretable electron-density map, although of a lesser quality than the
SIRAS map (see Table 2).

The test of using the anomalous scattering of a derivative by itself, in the
second example, is of special interest. It was not useful for the determination
of the structure in that particular case because the isomorphism was relatively
good between the native and mercury derivative crystals. It shows nonetheless
that, in cases of very strong non-isomorphism, a well-substituted derivative
can be used *by itself* to provide phase information, if the anomalous
signal is strong. In such a case of complete bimodality in the phase
distribution of acentric reflexions, the main purpose of the density
modification procedure is to select the right mode. SOLOMON seems to perform
this task for most reflexions thanks to the envelope constraints.

[3] R.G. Hart *Acta Cryst.* **14**, pp.1194-1195, 1961.

[22] G. Bricogne *op. cit,* 1991.

[24] D.H. Templeton & L.K. Templeton *Acta Cryst.* A**38**, 62-67,
1982.

[25] L.K. Templeton & D.H. Templeton *Acta Cryst.* A**44**,
1045-1051, 1988.

[26] V. Luzzati *Acta Cryst.* **5**, 802-810, 1952.

[36] S. Price & K. Nagai, unpublished results.

** Glossary** :

**FOM** is the mean figure of merit in that resolution bin.

**DELTAPHI** is the mean phase difference, *weighted by amplitude and FOM*,
in that resolution bin.

**CORREL** is a reciprocal-space correlation coefficient between complex structure
factors. By Parseval's theorem it is equivalent to a real-space correlation coefficient
in that resolution bin.

Resolution ALL 50.0 5.25 3.73 3.05 2.64 2.36 2.16 2.00 1.87 (Å)

SHARP refinement and phasing, density modification with SOLOMON

FOM 0.90 0.84 0.91 0.91 0.90 0.90 0.90 0.90 0.89 <DELTAPHI> 30.5 39.1 25.3 29.5 32.0 30.6 29.3 30.9 32.2 CORREL 0.80 0.70 0.86 0.81 0.77 0.80 0.82 0.80 0.78

MLPHARE refinement and phasing, density modification with SOLOMON

FOM 0.90 0.84 0.91 0.91 0.90 0.90 0.90 0.90 0.89 <DELTAPHI> 30.5 39.1 25.3 29.5 32.0 30.6 29.3 30.9 32.2 CORREL 0.80 0.70 0.86 0.81 0.77 0.80 0.82 0.80 0.78

Table 1 : Quality of IF3-C MAD phasing, in comparison with the refined model

Resolution ALL 50.0 7.83 5.57 4.56 3.95 3.54 3.23 2.99 2.80 (Å)

SHARP refinement and phasing, density modification with SOLOMON - SIRAS data

FOM 0.90 0.89 0.95 0.96 0.96 0.95 0.92 0.89 0.78 <DELTAPHI> 43.3 38.9 35.9 32.6 36.8 42.6 50.8 57.8 64.6 CORREL 0.66 0.64 0.74 0.78 0.73 0.66 0.55 0.46 0.37

SHARP refinement and phasing, density modification with SOLOMON - SAD data

FOM 0.90 0.86 0.93 0.95 0.94 0.95 0.92 0.88 0.82 <DELTAPHI> 57.0 58.2 50.0 48.2 50.7 55.8 62.9 68.0 72.2 CORREL 0.49 0.45 0.57 0.60 0.56 0.49 0.39 0.32 0.26

Table 2 : Quality of U2 SIRAS phasing and SAD (Single-Wavelength Anomalous
Diffraction) phasing, with SHARP