(file) Return to refine.tex CVS log (file) (dir) Up to [CCP4] / ccp4 / manual

File: [CCP4] / ccp4 / manual / refine.tex (download) / (as text)
Revision: 1.17, Wed Nov 16 15:46:42 2005 UTC (4 years, 9 months ago) by mdw
Branch: MAIN
CVS Tags: series-6_1-root, series-6_1, series-6_0_99e-root, series-6_0_99e, series-6_0_99d-root, series-6_0_99d, release-6_1_3, release-6_1_24, release-6_1_2, release-6_1_13, release-6_1_1, release-6_1_0, release-6_0_patch, release-6_0_2, release-6_0_1, release-6_0, pre_xia2_remove, pre-merge-release-6_1_3, pre-merge-6_0_99e, pre-merge-6_0_99d, pre-merge-20_4_2009, pre-merge-13_08_2009, post_merge-13_08_2009, post-merge-release-6_1_3, post-merge-6_0_99d, post-merge-20_4_2009, merge-release_6_1_0, merge-6_0_patch_240407, merge-6_0_patch_100907, merge-5_8_2008, branch-merge-20_4_2009, branch-merge-13_08_2009, HEAD
Changes since 1.16: +0 -5 lines
Ian Tickle says this is not true (bugzilla 451) so
remove it.

\chapter{Refinement and validation}

\begin{chapquote}
  {\em I} don't believe there's an atom of meaning in it.\\
  --- {\em Alice's Adventures in Wonderland}
\end{chapquote}

Refinement has been covered in three Study Weekends,
\cite{study-w/e-80,study-w/e-89,study-w/e-96}; 
see also \cite{tronrud-94}.  The principal \ccp4 refinement
program is now \cprog{refmac} which is undergoing active
development. Information in this chapter is liable to become
out-of-date, and the reader is recommended to read the latest
\cprog{refmac} documentation.

\section{Least squares structure
  refinement}\index{refinement!least-squares}
  \index{refinement!Hendrickson--Konnert}

The \ccp4 suite provides the program \cprog{restrain} for least 
squares structure refinement. The old program prolsq has 
now been made obsolete by the Maximum Likelihood program \cprog{refmac}, 
see section~\ref{refinement:maxlikelihood}.
The least-squares program \cprog{restrain} incorporates
a number of differences from \cprog{refmac} (see
section~\ref{refinement:comparison}) and may be useful.

Alternative refinement programs include \prog{TNT}, \prog{CNS},
\prog{SHELX93}. 

\subsection{Comparison of \texttt{restrain} and \texttt{prolsq}}
\label{refinement:comparison}

% from Ian Tickle
Here is some discussion of the differences between the least squares refinement
program \cprog{restrain} and the obsolete prolsq. Most of these points also
apply to a comparison of \cprog{restrain} and \cprog{refmac}.

\cprog{restrain} does constrained anisotropic thermal parameter refinement
using the TLS\index{TLS refinement} (translation/libration/screw-rotation) 
model. Unconstrained anisotropic refinement is not feasible without atomic resolution
data (i.e.\ 1\,\AA or better), so this is out of the question for all but a
handful of very small proteins.  The end result of the TLS analysis should
give some insight into secondary structure or domain motions.
\cprog{refmac5} now does TLS refinement as well.

A major difference of approach is that prolsq uses an FFT for
structure factor and derivative calculations, whereas \cprog{restrain} uses
slow FT's.  This means that prolsq takes many cycles ($\sim$50) to
converge but each cycle is very fast, whereas \cprog{restrain} takes only a
few cycles ($\sim$5), but each one is much slower.  Normally prolsq
has the advantage here.

\cprog{restrain}'s functionally is  much like prolsq's with some small differences:
\begin{itemize}
\item prolsq treats all main-chain peptide residues as though they had
  identical geometry; recent data indicates that glycine and proline are
  different from the others.
  % (anyone with a basic knowledge of chemistry would think that would be
  % obvious!).
  \cprog{restrain} treats them differently.

\item \cprog{restrain} uses individual distance constraint weights based on
  the estimates of the standard deviations of \cite{engh/huber-91} (these
  values are all in the dictionary together with the ideal distances).
  prolsq uses blanket values for the weights, because its dictionary
  doesn't contain the s.d.s.
%  What precisely is the effect of this though, I'm not sure.

\item \cprog{restrain} has a coupled occupancy refinement option for
  disordered sidechains.%; I don't think prolsq does this.

\item \cprog{restrain} has a full-matrix option for estimating individual
  positional standard deviations.  However it requires a \emph{lot} of memory,
  and at present needs to be compiled with a re-parameterised include file.
% Sometime I'll incorporate your dynamic memory routines to make this more
% user-friendly.

\item
  %I think
  prolsq has the option of applying non-bonded intermolecular
  repulsion restraints (i.e. between symmetry-related molecules) as well as
  the intramolecular ones.  At present \cprog{restrain} only applies the
  intramolecular repulsions.

\item %I think
  There is a subtle difference in the way the planar groups 
  (peptide groups plus PHE, ARG etc.\ sidechains) are treated.  % I believe
  prolsq restrains to the plane calculated from the coordinates before
  each refinement cycle, whereas \cprog{restrain} restrains to the current
  best plane; this should allow the planes more flexibility of movement.
\end{itemize}

\section{Maximum Likelihood refinement}\label{refinement:maxlikelihood}

The \idx{maximum likelihood} approach to model refinement has been
implemented in the \ccp4 program \cprog{refmac}.  The \cprog{refmac}
program is used for the restrained or unrestrained refinement or
idealisation of a macromolecular structure.  It minimises the
model parameters to satisfy a Maximum Likelihood residual. There
are options to use different minimization methods.  \cprog{refmac}
produces an MTZ output file containing \cprog{sigmaa}-style
coefficients suitable for the calculation of $mF_\mr{o}-DF_\mr{c}$ and
$2mF_\mr{o}-DF_\mr{c}$ maps using \cprog{fft}.

The latest version of REFMAC is significantly different to earlier
versions and is known as ``\cprog{refmac5}''. The key functionalities
of \cprog{refmac5} are:
\begin{itemize}
\item
  Restraints are calculated within the main program. A large 
  dictionary of standard geometries is included. Restraints can also
  be created for novel ligands, see in particular the Monomer Library
  Sketcher in \cprog{ccp4i},(which is an interface to the
  \cprog{libcheck} program).
\item
  \idx{TLS refinement} can be used. This is particularly useful when there
  is significant anisotropy, but the resolution does not warrant
  refinement of individual anisotropic displacement parameters
  (U values).
\item
  A \idx{bulk solvent correction} is calculated within the program, see
  the SOLVENT keyword.
\item
  For atomic resolution data, full anisotropic refinement can be performed
  with anisotropic displacement parameters being refined for some or all
  atoms.
\item
  If good experimental phases are available then they can be included in
  the maximum likelihood target. The accuracy of experimental phases, as
  described by the Figure of Merit or the Hendrickson-Lattman coefficients,
  is often overestimated, and a blurring function is provided to
  compensate for this.
\item
  Rigid-body refinement can be performed, and may be useful in the early
  stages of refinement. One or more rigid-body domains can be defined via 
  the RIGIDBODY keyword.
\end{itemize}

\subsection{TLS refinement in \texttt{refmac5}}
\label{refinement:TLS}

TLS refinement in \cprog{refmac5} results in:
\begin{itemize}
\item
  TLS parameters for each defined TLS group, held in the {\tt TLSOUT}
  file and in the header of the {\tt XYZOUT} file.
\item
  Residual B factors output in the ATOM lines of the {\tt XYZOUT} file.
  These B factors do not include any contribution from the TLS parameters.
\end{itemize}
The {\tt XYZOUT} and {\tt TLSOUT} files can be passed to the program
\cprog{tlsanl}, which will analyse the TLS tensors and also derive
individual anisotropic displacement parameters from the TLS parameters.

\section{Automated model building}\label{refinement:arpwarp}

Victor Lamzin's Automated Refinement Procedure program (renamed
\prog{arp\_warp} due to a clash with a Unix command) can be alternated
with a refinement program such as \cprog{refmac5} to automatically
build or rebuild parts of a model. \prog{arp\_warp} updates the model
by identifying and removing poorly defined atoms and adding new
atoms. Rejection of atoms is carried out on the basis of the density
interpolated at the atomic centre, the deviation of the density
shape from sphericity and some distance criteria. Addition of atoms
is performed on the basis of difference density coupled with
distance constraints.

CCP4 distributes an older version of \prog{arp\_warp} (version 5.0)
which has been renamed as \cprog{arp\_waters}, and which should only
be used for adding waters while cycling with \cprog{refmac5}.

\section{Difference map generation}

On completing a round of refinement, various types of \idx{difference map}
can be generated with the program \cprog{fft} for comparison with the 
current model (e.g. using the graphics program \prog{O}). \cprog{refmac} produces
weighted map coefficients suitable for 
$mF_\mr{o}-DF_\mr{c}$ and $2mF_\mr{o}-DF_\mr{c}$ maps: these coefficients
reduce model bias and are recommended over the unweighted
$F_\mr{o}-F_\mr{c}$ and $2F_\mr{o}-F_\mr{c}$ maps. To obtain weighted
coefficients from the output of \cprog{restrain}, the program
\cprog{sigmaa} can be used.

\begin{figure}[H]
  \begin{maxipage}
  \mbox{\psfig{file=diff-map.eps,width=\textwidth}}
  \caption{Steps in generating an electron density difference map.}
  \end{maxipage}
  \label{fig:diff-map}
  \inviscprog{restrain}  \inviscprog{fft}     \inviscprog{sigmaa}
  \inviscprog{extend}    \inviscprog{mapmask} \inviscprog{refmac}
  \inviscprog{xdlmapman} \invisprog{O} 
\end{figure}

\section{Why is protein refinement difficult?}

Small molecule people manage analysis and refinement with very few problems. 
Macromolecular crystals present several particular problems in refinement.

\begin{itemize}
\item For macromolecular crystals, the unit cell is big, and there are a very
  large number of X-ray data to collect, all of which have low signal-to-noise
  ratio.  It is therefore not usually possible to collect data to atomic
  resolution as is normal for small molecule structures. The data available
  often suffer from both systematic and random errors. These are due to the
  crystal size, problems of mounting, absorption and crystal decay.
\item Protein crystals have an additional problem. There is usually a high
  solvent content, and the crystal forces are weak. Some parts of the chain
  may not be crystalline at all, and others may have high thermal motion. This
  means that not all the unit cell can be properly parameterised. This is true
  for almost all proteins, not just those which diffract to lower resolution.
  This problem particularly reduces the intensity of the high resolution data.
  In addition it leads to severe effects of radiation damage.
\item These two problems mean that experimental data extend to limited
  resolution, typically to a maximum limit in the range 3--2\,\AA.
\item This means that the ratio of observations to parameters to be fitted is
  too low for conventional least-squares minimisation to converge.
\end{itemize}

\section{Free $R$ factor}\index{free R factor@free $R$ factor}

A \introduce{free $R$ factor} may be calculated by excluding a
randomly-chosen fraction of reflexions from the refinement
\cite{brunger-92}---a special case of the technique of \idx{cross-validation}
\cite{brunger-95}.  The agreement between their $F_P$ and $F_c$
is independent of the refinement procedure.  \cprog{freerflag} may be
used to add a column of tags to an MTZ file to label this set of
reflexions. This is also included in the \cprog{uniqueify} script,
which should be run on a dataset as soon as possible (see
\cprog{unique} documentation). Note the \ccp4 convention
for this differs from \prog{X-PLOR}---see the \cprog{freerflag} documentation.
\cprog{f2mtz} and \cprog{mtz2various} may be used respectively to import 
and export \prog{X-PLOR} and \prog{SHELX} datasets with a free $R$ flag 
to \ccp4 taking into account the different conventions.  

Since the deviation in $R_\mathrm{free}$
is roughly proportional to its value divided by the square root of the number
of reflections, a test set of about 1000 reflections should be acceptable.
Both $R_\mathrm{cryst}$ and $R_\mathrm{free}$ are global measures which cannot
detect local errors. If atoms are placed in correct positions the $R$ factor
will decrease even if they are chemically inappropriate.  NCS will reduce the
value of $R_\mathrm{free}$ and different types of NCS will have different
effects. Any pseudo lattice where the NCS does not increase the reciprocal
space sampling cannot easily be utilised for refinement.

\section{Validation, gross and overall errors}
\subsection{Validation}

An lot of validation requires common sense.  For example a good structure will
not have most of its torsion angles in strange parts of the \idx{Ramachandran
  plot}.  Other obvious, but very useful, checks include:
\begin{itemize}
\item Are there unacceptable symmetry contacts between adjacent molecules?
\item Are the $B$ factors sensible, e.g.\ higher at the surface than in the
  core, not wildly divergent between adjacent atoms?
\item Does the chemistry make sense? e.g.: do the H-bondable groups actually
  make H-bonds? Are there charged groups buried in hydrophobic environments?
\item Do the maps show the expected features, e.g.: do omit maps reveal the
  missing atoms; do difference maps show substrate atoms?
\item Is there a suspicious divergence from NCS\@?
\end{itemize}
Most programs flag many of the above.

\subsection{Errors}

\begin{description}
\item[Serious mistracing is rare but can happen.]  The existing tools
  ($R_\mathrm{free}$, stereochemical checks as applied in \cprog{procheck} and
  \prog{what-if}) easily detect such errors if applied sensibly.  Actually the
  \idx{Ramachandran plot} alone is a powerful tool for cross validation to
  identify such gross errors since the dihedral angles are not usually used as
  restraints in refinement programs.
\item[Local errors] such as loops out of register can easily be overlooked or
  ignored.  They can be identified by the real space $R$ factor and $B$ values
  as a function of residue number.
\item[Overall imprecision] and refinement not taken to convergence is
  difficult to detect but happens---there are many examples of structures
  refined first against one data set to a certain resolution, then re-refined
  against a different higher resolution data set.  The final structure gives a
  lower $R$ factor against the original data than the one refined against that
  data.  This is may be due to poor weighting of prior and experimental
  information.
\end{description}

\subsection{Bad practice}

\subsubsection{Not using all the available data}
\begin{itemize}
\item Do not use a low resolution cut-off, e.g.\ many structures are still
  reported as being refined with data in the resolution range 5--2\,\AA.  The
  data within the 5\,\AA\ shell contain a wealth of important information on
  your structure.
\item Make sure you have not lost all the strong (often low resolution) terms
  through detector saturation---especially important with image plates at
  synchrotrons.  Make a second data collection pass, or even a third, to avoid
  this. The \emph{big} terms dominate all steps in your structure analysis.
\item If you can possibly avoid it do not leave a large wedge of data
  uncollected. Offset your crystal by up to $15^\circ$ to avoid a blind
  region.  Make sure you cover the appropriate rotation range, and start at an
  appropriate orientation.
\end{itemize}

\subsubsection{Attempting refinements when the observation to parameter ratio
  is too low}
\begin{itemize}
\item At about 2.8\,\AA\ for a protein crystal with about 50\% solvent, the
  number of observations is equal to the number of positional ($xyz$) atomic
  parameters.  Even at this resolution the least-squares minimum is no longer
  well defined.  Unless there is non crystallographic symmetry (or extremely
  high solvent content) it is therefore \emph{foolish} to ``refine'' against
  data sets at resolutions below 2.8\,\AA\@. If there is NCS this limit can be
  relaxed with care, always ensuring the number of parameters is less than the
  observations: this absolutely requires the NCS to be imposed.  Caveat: if
  your NCS is close to pseudo crystallographic symmetry (e.g.\ $P2_12_12$ but
  pseudo I222 or $P6_5$ pseudo $P6_522$), then it is less powerful and you will
  have special problems.
\item Do not try to refine individual isotropic atomic $B$ values till you
  have enough observations to about 2.5\,\AA\ or better.
\item The significance of introducing extra parameters should be cross
  validated using $R_\mathrm{free}$.
\end{itemize}

% Local Variables: 
% mode: latex
% TeX-master: "manual"
% End: 

ccp4@ccp4.ac.uk
Powered by
ViewCVS 0.9.3