|
|
|
|
File: [CCP4] / ccp4 / manual / refine.tex
(download)
/
(as text)
Revision: 1.17, Wed Nov 16 15:46:42 2005 UTC (4 years, 9 months ago) by mdw Branch: MAIN CVS Tags: series-6_1-root, series-6_1, series-6_0_99e-root, series-6_0_99e, series-6_0_99d-root, series-6_0_99d, release-6_1_3, release-6_1_24, release-6_1_2, release-6_1_13, release-6_1_1, release-6_1_0, release-6_0_patch, release-6_0_2, release-6_0_1, release-6_0, pre_xia2_remove, pre-merge-release-6_1_3, pre-merge-6_0_99e, pre-merge-6_0_99d, pre-merge-20_4_2009, pre-merge-13_08_2009, post_merge-13_08_2009, post-merge-release-6_1_3, post-merge-6_0_99d, post-merge-20_4_2009, merge-release_6_1_0, merge-6_0_patch_240407, merge-6_0_patch_100907, merge-5_8_2008, branch-merge-20_4_2009, branch-merge-13_08_2009, HEAD Changes since 1.16: +0 -5 lines Ian Tickle says this is not true (bugzilla 451) so remove it. |
\chapter{Refinement and validation}
\begin{chapquote}
{\em I} don't believe there's an atom of meaning in it.\\
--- {\em Alice's Adventures in Wonderland}
\end{chapquote}
Refinement has been covered in three Study Weekends,
\cite{study-w/e-80,study-w/e-89,study-w/e-96};
see also \cite{tronrud-94}. The principal \ccp4 refinement
program is now \cprog{refmac} which is undergoing active
development. Information in this chapter is liable to become
out-of-date, and the reader is recommended to read the latest
\cprog{refmac} documentation.
\section{Least squares structure
refinement}\index{refinement!least-squares}
\index{refinement!Hendrickson--Konnert}
The \ccp4 suite provides the program \cprog{restrain} for least
squares structure refinement. The old program prolsq has
now been made obsolete by the Maximum Likelihood program \cprog{refmac},
see section~\ref{refinement:maxlikelihood}.
The least-squares program \cprog{restrain} incorporates
a number of differences from \cprog{refmac} (see
section~\ref{refinement:comparison}) and may be useful.
Alternative refinement programs include \prog{TNT}, \prog{CNS},
\prog{SHELX93}.
\subsection{Comparison of \texttt{restrain} and \texttt{prolsq}}
\label{refinement:comparison}
% from Ian Tickle
Here is some discussion of the differences between the least squares refinement
program \cprog{restrain} and the obsolete prolsq. Most of these points also
apply to a comparison of \cprog{restrain} and \cprog{refmac}.
\cprog{restrain} does constrained anisotropic thermal parameter refinement
using the TLS\index{TLS refinement} (translation/libration/screw-rotation)
model. Unconstrained anisotropic refinement is not feasible without atomic resolution
data (i.e.\ 1\,\AA or better), so this is out of the question for all but a
handful of very small proteins. The end result of the TLS analysis should
give some insight into secondary structure or domain motions.
\cprog{refmac5} now does TLS refinement as well.
A major difference of approach is that prolsq uses an FFT for
structure factor and derivative calculations, whereas \cprog{restrain} uses
slow FT's. This means that prolsq takes many cycles ($\sim$50) to
converge but each cycle is very fast, whereas \cprog{restrain} takes only a
few cycles ($\sim$5), but each one is much slower. Normally prolsq
has the advantage here.
\cprog{restrain}'s functionally is much like prolsq's with some small differences:
\begin{itemize}
\item prolsq treats all main-chain peptide residues as though they had
identical geometry; recent data indicates that glycine and proline are
different from the others.
% (anyone with a basic knowledge of chemistry would think that would be
% obvious!).
\cprog{restrain} treats them differently.
\item \cprog{restrain} uses individual distance constraint weights based on
the estimates of the standard deviations of \cite{engh/huber-91} (these
values are all in the dictionary together with the ideal distances).
prolsq uses blanket values for the weights, because its dictionary
doesn't contain the s.d.s.
% What precisely is the effect of this though, I'm not sure.
\item \cprog{restrain} has a coupled occupancy refinement option for
disordered sidechains.%; I don't think prolsq does this.
\item \cprog{restrain} has a full-matrix option for estimating individual
positional standard deviations. However it requires a \emph{lot} of memory,
and at present needs to be compiled with a re-parameterised include file.
% Sometime I'll incorporate your dynamic memory routines to make this more
% user-friendly.
\item
%I think
prolsq has the option of applying non-bonded intermolecular
repulsion restraints (i.e. between symmetry-related molecules) as well as
the intramolecular ones. At present \cprog{restrain} only applies the
intramolecular repulsions.
\item %I think
There is a subtle difference in the way the planar groups
(peptide groups plus PHE, ARG etc.\ sidechains) are treated. % I believe
prolsq restrains to the plane calculated from the coordinates before
each refinement cycle, whereas \cprog{restrain} restrains to the current
best plane; this should allow the planes more flexibility of movement.
\end{itemize}
\section{Maximum Likelihood refinement}\label{refinement:maxlikelihood}
The \idx{maximum likelihood} approach to model refinement has been
implemented in the \ccp4 program \cprog{refmac}. The \cprog{refmac}
program is used for the restrained or unrestrained refinement or
idealisation of a macromolecular structure. It minimises the
model parameters to satisfy a Maximum Likelihood residual. There
are options to use different minimization methods. \cprog{refmac}
produces an MTZ output file containing \cprog{sigmaa}-style
coefficients suitable for the calculation of $mF_\mr{o}-DF_\mr{c}$ and
$2mF_\mr{o}-DF_\mr{c}$ maps using \cprog{fft}.
The latest version of REFMAC is significantly different to earlier
versions and is known as ``\cprog{refmac5}''. The key functionalities
of \cprog{refmac5} are:
\begin{itemize}
\item
Restraints are calculated within the main program. A large
dictionary of standard geometries is included. Restraints can also
be created for novel ligands, see in particular the Monomer Library
Sketcher in \cprog{ccp4i},(which is an interface to the
\cprog{libcheck} program).
\item
\idx{TLS refinement} can be used. This is particularly useful when there
is significant anisotropy, but the resolution does not warrant
refinement of individual anisotropic displacement parameters
(U values).
\item
A \idx{bulk solvent correction} is calculated within the program, see
the SOLVENT keyword.
\item
For atomic resolution data, full anisotropic refinement can be performed
with anisotropic displacement parameters being refined for some or all
atoms.
\item
If good experimental phases are available then they can be included in
the maximum likelihood target. The accuracy of experimental phases, as
described by the Figure of Merit or the Hendrickson-Lattman coefficients,
is often overestimated, and a blurring function is provided to
compensate for this.
\item
Rigid-body refinement can be performed, and may be useful in the early
stages of refinement. One or more rigid-body domains can be defined via
the RIGIDBODY keyword.
\end{itemize}
\subsection{TLS refinement in \texttt{refmac5}}
\label{refinement:TLS}
TLS refinement in \cprog{refmac5} results in:
\begin{itemize}
\item
TLS parameters for each defined TLS group, held in the {\tt TLSOUT}
file and in the header of the {\tt XYZOUT} file.
\item
Residual B factors output in the ATOM lines of the {\tt XYZOUT} file.
These B factors do not include any contribution from the TLS parameters.
\end{itemize}
The {\tt XYZOUT} and {\tt TLSOUT} files can be passed to the program
\cprog{tlsanl}, which will analyse the TLS tensors and also derive
individual anisotropic displacement parameters from the TLS parameters.
\section{Automated model building}\label{refinement:arpwarp}
Victor Lamzin's Automated Refinement Procedure program (renamed
\prog{arp\_warp} due to a clash with a Unix command) can be alternated
with a refinement program such as \cprog{refmac5} to automatically
build or rebuild parts of a model. \prog{arp\_warp} updates the model
by identifying and removing poorly defined atoms and adding new
atoms. Rejection of atoms is carried out on the basis of the density
interpolated at the atomic centre, the deviation of the density
shape from sphericity and some distance criteria. Addition of atoms
is performed on the basis of difference density coupled with
distance constraints.
CCP4 distributes an older version of \prog{arp\_warp} (version 5.0)
which has been renamed as \cprog{arp\_waters}, and which should only
be used for adding waters while cycling with \cprog{refmac5}.
\section{Difference map generation}
On completing a round of refinement, various types of \idx{difference map}
can be generated with the program \cprog{fft} for comparison with the
current model (e.g. using the graphics program \prog{O}). \cprog{refmac} produces
weighted map coefficients suitable for
$mF_\mr{o}-DF_\mr{c}$ and $2mF_\mr{o}-DF_\mr{c}$ maps: these coefficients
reduce model bias and are recommended over the unweighted
$F_\mr{o}-F_\mr{c}$ and $2F_\mr{o}-F_\mr{c}$ maps. To obtain weighted
coefficients from the output of \cprog{restrain}, the program
\cprog{sigmaa} can be used.
\begin{figure}[H]
\begin{maxipage}
\mbox{\psfig{file=diff-map.eps,width=\textwidth}}
\caption{Steps in generating an electron density difference map.}
\end{maxipage}
\label{fig:diff-map}
\inviscprog{restrain} \inviscprog{fft} \inviscprog{sigmaa}
\inviscprog{extend} \inviscprog{mapmask} \inviscprog{refmac}
\inviscprog{xdlmapman} \invisprog{O}
\end{figure}
\section{Why is protein refinement difficult?}
Small molecule people manage analysis and refinement with very few problems.
Macromolecular crystals present several particular problems in refinement.
\begin{itemize}
\item For macromolecular crystals, the unit cell is big, and there are a very
large number of X-ray data to collect, all of which have low signal-to-noise
ratio. It is therefore not usually possible to collect data to atomic
resolution as is normal for small molecule structures. The data available
often suffer from both systematic and random errors. These are due to the
crystal size, problems of mounting, absorption and crystal decay.
\item Protein crystals have an additional problem. There is usually a high
solvent content, and the crystal forces are weak. Some parts of the chain
may not be crystalline at all, and others may have high thermal motion. This
means that not all the unit cell can be properly parameterised. This is true
for almost all proteins, not just those which diffract to lower resolution.
This problem particularly reduces the intensity of the high resolution data.
In addition it leads to severe effects of radiation damage.
\item These two problems mean that experimental data extend to limited
resolution, typically to a maximum limit in the range 3--2\,\AA.
\item This means that the ratio of observations to parameters to be fitted is
too low for conventional least-squares minimisation to converge.
\end{itemize}
\section{Free $R$ factor}\index{free R factor@free $R$ factor}
A \introduce{free $R$ factor} may be calculated by excluding a
randomly-chosen fraction of reflexions from the refinement
\cite{brunger-92}---a special case of the technique of \idx{cross-validation}
\cite{brunger-95}. The agreement between their $F_P$ and $F_c$
is independent of the refinement procedure. \cprog{freerflag} may be
used to add a column of tags to an MTZ file to label this set of
reflexions. This is also included in the \cprog{uniqueify} script,
which should be run on a dataset as soon as possible (see
\cprog{unique} documentation). Note the \ccp4 convention
for this differs from \prog{X-PLOR}---see the \cprog{freerflag} documentation.
\cprog{f2mtz} and \cprog{mtz2various} may be used respectively to import
and export \prog{X-PLOR} and \prog{SHELX} datasets with a free $R$ flag
to \ccp4 taking into account the different conventions.
Since the deviation in $R_\mathrm{free}$
is roughly proportional to its value divided by the square root of the number
of reflections, a test set of about 1000 reflections should be acceptable.
Both $R_\mathrm{cryst}$ and $R_\mathrm{free}$ are global measures which cannot
detect local errors. If atoms are placed in correct positions the $R$ factor
will decrease even if they are chemically inappropriate. NCS will reduce the
value of $R_\mathrm{free}$ and different types of NCS will have different
effects. Any pseudo lattice where the NCS does not increase the reciprocal
space sampling cannot easily be utilised for refinement.
\section{Validation, gross and overall errors}
\subsection{Validation}
An lot of validation requires common sense. For example a good structure will
not have most of its torsion angles in strange parts of the \idx{Ramachandran
plot}. Other obvious, but very useful, checks include:
\begin{itemize}
\item Are there unacceptable symmetry contacts between adjacent molecules?
\item Are the $B$ factors sensible, e.g.\ higher at the surface than in the
core, not wildly divergent between adjacent atoms?
\item Does the chemistry make sense? e.g.: do the H-bondable groups actually
make H-bonds? Are there charged groups buried in hydrophobic environments?
\item Do the maps show the expected features, e.g.: do omit maps reveal the
missing atoms; do difference maps show substrate atoms?
\item Is there a suspicious divergence from NCS\@?
\end{itemize}
Most programs flag many of the above.
\subsection{Errors}
\begin{description}
\item[Serious mistracing is rare but can happen.] The existing tools
($R_\mathrm{free}$, stereochemical checks as applied in \cprog{procheck} and
\prog{what-if}) easily detect such errors if applied sensibly. Actually the
\idx{Ramachandran plot} alone is a powerful tool for cross validation to
identify such gross errors since the dihedral angles are not usually used as
restraints in refinement programs.
\item[Local errors] such as loops out of register can easily be overlooked or
ignored. They can be identified by the real space $R$ factor and $B$ values
as a function of residue number.
\item[Overall imprecision] and refinement not taken to convergence is
difficult to detect but happens---there are many examples of structures
refined first against one data set to a certain resolution, then re-refined
against a different higher resolution data set. The final structure gives a
lower $R$ factor against the original data than the one refined against that
data. This is may be due to poor weighting of prior and experimental
information.
\end{description}
\subsection{Bad practice}
\subsubsection{Not using all the available data}
\begin{itemize}
\item Do not use a low resolution cut-off, e.g.\ many structures are still
reported as being refined with data in the resolution range 5--2\,\AA. The
data within the 5\,\AA\ shell contain a wealth of important information on
your structure.
\item Make sure you have not lost all the strong (often low resolution) terms
through detector saturation---especially important with image plates at
synchrotrons. Make a second data collection pass, or even a third, to avoid
this. The \emph{big} terms dominate all steps in your structure analysis.
\item If you can possibly avoid it do not leave a large wedge of data
uncollected. Offset your crystal by up to $15^\circ$ to avoid a blind
region. Make sure you cover the appropriate rotation range, and start at an
appropriate orientation.
\end{itemize}
\subsubsection{Attempting refinements when the observation to parameter ratio
is too low}
\begin{itemize}
\item At about 2.8\,\AA\ for a protein crystal with about 50\% solvent, the
number of observations is equal to the number of positional ($xyz$) atomic
parameters. Even at this resolution the least-squares minimum is no longer
well defined. Unless there is non crystallographic symmetry (or extremely
high solvent content) it is therefore \emph{foolish} to ``refine'' against
data sets at resolutions below 2.8\,\AA\@. If there is NCS this limit can be
relaxed with care, always ensuring the number of parameters is less than the
observations: this absolutely requires the NCS to be imposed. Caveat: if
your NCS is close to pseudo crystallographic symmetry (e.g.\ $P2_12_12$ but
pseudo I222 or $P6_5$ pseudo $P6_522$), then it is less powerful and you will
have special problems.
\item Do not try to refine individual isotropic atomic $B$ values till you
have enough observations to about 2.5\,\AA\ or better.
\item The significance of introducing extra parameters should be cross
validated using $R_\mathrm{free}$.
\end{itemize}
% Local Variables:
% mode: latex
% TeX-master: "manual"
% End:
| ccp4@ccp4.ac.uk |
Powered by ViewCVS 0.9.3 |