Newsletter contents... UP

Report from the Joint CCP4/EBI Software Developers and Data Harvesting Workshop

Kim Henrick and Eleanor Dodson

This is a report on the Protein Crystallographic Software Developers Workshop for the use of Standard Library Routines with CCP4 and EBI Data Harvesting which was held on September 16th to 19th 1998 at the European Bioinformatics Institute Hinxton, Cambridge CB10 1SD, UK

The list of participants can be found at http:/ This is a slightly edited version of the report originally produced by Kim and Eleanor.


The workshop was held to discuss Data Harvesting of the detail required to characterise macromolecular structure determination and description in a consistent manner (see the background notes from a previous newsletter article at The introduction to the CCP4 subroutine libraries and data format files was presented as an example of standards for a mechanism to allow for the free flow of data between the major structure determination packages, including CNS, O, and CCP4, as well as for deposition. CCP4 also presented their current ideas for using mmCIF as a working coordinate file format to be used in the process of a structure determination.

The workshop outlined the current deposition practice, mandatory and voluntary requirements for deposition, the use of mmCIF tokens to define this information, and the demonstration of a future deposition tool. Discussions were also held on methods for the implementation of data harvesting specific to particular software packages, that were realistic and achievable.


The disussions are broken down into seven topics:

  1. Harvesting
  2. Deposition Tools
  3. CCP4 Library Routines
  4. mmCIF Coordinate Files
  5. External Reference Files (e.g. refinement dictionaries)
  6. Validation

Each of the discussions on the various topics are summarised below in separate sections.

1. Harvesting

Kim Henrick gave a general introduction to the harvest concept (see some of the slides at ). There was total acceptance that programs should produce the required information in a recognisable style, and it was agreed that every deposition should have associated in-house tags that identified the Project_Name and Data_Set_Name, where the Project_Name is the working equivalent to what will become a PDB IDcode (or in mmCIF terms the ) and the Data_Set_Name is the particular data set (either X-ray diffraction structure factors or NMR experimentally determined data) that is associated with the deposited refined coordinates ( in mmCIF). Each run then of a particular program would then produce a snapshot harvest file containing the relevant information known to that program as mmCIF tags and value pairs. In addition each file requires a date stamp, _audit.creation_date , using the CIF definition, CIF-datetime

The final set of these files would then be deposited and the central archive sites would merge and process this information automatically.

It was acknowledged that all software packages need to encourage users to consistently input the same in-house identifiers to each program by a common set of key words for Project_Name and Data_Set_Name. Data harvesting throughout the course of a structure determination requires a minimal discipline from the researcher.

Programmers are required to follow mmCIF definitions for the items selected as important for deposition. Paula Fiztgerald present the plan for extending the mmCIF Dictionary with new data items as set out in the URL (for Europe which outlines CIF extensions.

The CCP4 prototype system uses an automatic routine to place harvest files from each program run into a fixed directory structure and file naming convention based on:
where $HARVEST is an environment variable and function is the particular run number or run path using a particular software package.

This file naming and placement proposal was discussed and thought to be difficult to maintain by the nature of the problems found when solving a macromolecular structure. In particular some steps may be carried out at different sites and for each stage program steps are often re-run and the stages are repeated in an iterative cycle. Data sets are commonly discarded as experience is gained and better crystals that give better data are grown. Therefore the above two program instructions are only required for the final run of each program. However, it is not always known at the time of a particular run that it will be the final one, and in the case of a preliminary publication the results are not definitive.

The data harvesting concept however is flexible enough to allow each program developer to not only select the data items generated by their own software but also to implement a method that is practical for that programs use.

The CNS developers, Paul Adams and Ralf Grosse-Kunstleve with John Westbrook have a procedure to generate deposition information when a user has finished refinement. The CNS method will request the Project_Name and Data_Set_Name when this macro is run and include sufficient information for the deposition centres to determine the exact chemical nature of all compounds represented in the coordinate set. The CNS mmCIF pdbsubmission macro will be extended to save coordinates, reflections, topology, parameters, and MTF information as a deposit.mmcif file. The macro will be added to with heavy atom details at a later date.

Bart Hazes presented some ideas on how to use harvested data for the purposes of the experimentalists rather than just for the deposition centres. This entails making harvesting attractive to the user by using the harvest file information as a source of in-house archive information. As the harvest files contain the experimental parameters and results of each stage of a structure determination, this information can be used in other ways. A GUI was suggested to read harvest files and to prepare experimental data for later reference, visualisation and additions of non-automatic deposition data items.

Randy Read suggested adding a history of parentage to harvest files to manage versions wherein additional data items would track the stages and versions that lead to each file. Once an experiment was completed either in-house of the deposition centre could then trace the history of files and offer a mechanism to ``weed'' out or select in only those pieces of information that were considered by the user to be the final versions.

Herb Bernstein gave a brief outline of the proposal to embed binary information in a CIF. This has harvest application in that images collected at an X-Ray source can carry Project_Name and Data_Set_Name tags along with other diffrn category data items can be passed into the first step of structure determination and carried automatically along via file headers throughout the whole structure solution process.

(See the URLs (or for Europe for details of CIF-CBF; see also for CBF_definitions and for CBF_software.)

Herb Bernstein also announced that there is now a RASMOL that will work directly on mmCIF input files. This is encouraging news that more tools are being coming available to process mmCIF files.

Follow-up Note:

Raimond Ravelli has proposed that ESRF beam line software can be modified to generate a first mmCIF harvest file containing a set of _diffrn and _exptl_crystal data items, i.e.


Gerard Kleywegt has proposed harvesting for O/OOPS, to generate files containing:-


Clemens Vonrhein and John Irwin are modifying the SHARP program to harvest heavy atom sites and phasing information.

2. Deposition Tools

Helen Berman presented an overview of the NDB's WWW deposition tool, mmCIFIT, and the philosophy behind its function. Here a mechanism for the capture of accurate information was demonstrated. The information content of the NDB was divided between, primary data collected from the depositor; derived information calculated by the NDB; and, qualitative information, shared between annotater and depositor. Some of the problems inherent in producing a smooth running deposition system were discussed, in the handling of multiple file formats, strict definition for the meaning of items, and maintaining a high through-put together with data uniformity. John Westbrook introduced the entire integrated data processing system which has syntax and semantics defined by mmCIF, where dictionaries play the key role in driving the software. The current validation features were also presented.

They divide the task into:-

 Pre-deposition checks
  a) Format and nomenclature
  b) validation checks
   Geometry:              Procheck
   Structure factors:     SFCHECK

Complete mmCIF files will be generated after processing is complete. Currently structure factors are expected in mmCIF format only. The system allows for pre-deposition checks before a deposition starts. The possibility of validation reports being made available to referees before an entry is released was raised, but is not yet allowed.

John Westbrook demonstrated the mmCIFIT deposition tool and a discussion followed concerning mandatory data items. The NDB viewed the sub-tokens for the PDB records COMPND and SOURCE to be over complex and in need of simplification. The NDB intend to also simplify the possible HEADER record classes for macromolecular entries. There was a discussion on the JRNL record as mandatory. There was in-sufficient time to complete a discussion of mandatory items as the discussion of input via a web tool of PDB equivalents to REMARK 200 and REMARK 280 revealed how complicated modern crystallography data collection could be. To capture all details that would fully define a diffraction experiment that could involve different crystals, and diffraction experiments and relating this to mmCIF identifiers that described crystallisation, diffraction, data reduction, and final structure factor refln sets was shown to be very complex.

Eldon Ulrich then present the BMRB deposition tools and schema for handling NMR experimental data. The BMRB system is a STAR based system that is currently adapted to work with the PDB's AutoDep2.1 procedure. The STAR nmr format was devised to record the multiple sets of samples and experiments that are undertaken by NMR to determine a macromolecular structure and reveal how inadequate the current PDB REMARK 210 was to realistically record details of these procedures. The ZOO project, a Desktop Experiment Management Environment that can be customized and placed on the desks of many scientists to manage their experimental studies, was outlined (see This is an in-house database system that can be used to manage all aspects of the determination of an NMR macromolecular structure. Coupled with the DEVise system, an Environment for Data Exploration and Visualization (see, a data exploration system that allows users to easily develop, browse, and share visual presentations of large tabular datasets from several sources, the BMRB is working towards developing an intuitive set of querying and visualization primitives that can be combined to develop a set of visual presentations that integrate data from a wide range of application domains.

Peter Keller presented a short overview of the EBI's development of a deposition tool (see part of this presentation at This will be based on the experience gained by the PDB in processing real data and will follow the outline of the current AutoDep2.1 process. The internal organisation will use commercial data-base management systems (ORACLE). The schema, processing flow, administration, version control and web interface will be held in the database schema SQL. The system will also generate web pages unique to individual depositions as required to request missing mandatory information.

3. CCP4 Library Routines

These sessions addressed the programmers present at the workshop. Phil Evans described the philosophy behind the CCP4 suite. CCP4 is a set of programs written by individuals, linked by a set of common formats. He devoted some time to describing the reflection data file, with the concept of labeled and typed columns. A detailed description of libraries is on the web at URL, see (CCP4 software documentation), together with library documentation. An extended set of templates for software developers are also available on the web at (CCP4 template fortran files).

Eleanor Dodson gave an overview of the CCP4 symmetry subroutine library, Phil Evans introduced the maplib routines and Martyn Winn discussed in detail the reflection format file and software routines for the MTZ library. Liz Potterton gave a presentation on a new CCP4 GUI (see a previous newsletter article at

Finally Alun Ashton gave an introduction to how to compile and run CCP4 style programs and introduced the keyparse routines.

4. mmCIF Coordinate Files

CCP4 devoted a session to the detailed discussions of mmCIF questions as related to coordinate files for a working format during the stages of structure determination (see (CCP4 CIF notes) and (CCP4 CIF coordinates)).

CCP4 plans to begin using a subset of mmCIF definitions as an atom coordinate format. The advantages of this are:-

The mmCIF atom format is much more appropriate for describing the model;
It is more flexible, and easily extensible.
It moves towards the IUCr standard.

The everyday use of this format raises several problems that require restrictions to be put on the mmCIF categories and data items to make them usable within existing programs without major rewriting.

Martyn Winn outlined CCP4 practice/theory.

Within CCP4 programs the atom format ( currently PDB) is used to transfer information which is not always strictly atom coordinates. BONES skeletonisation, atomic vectors, and peaks taken from electron density are examples of this.

Martyn is preparing a set of library subroutines to replace RWBROOK which will read and write the cCIF stuff. He uses Peter Keller's libccif toolbox for the low level parsing (see LIBCCIF at

Initially these will read/write the following categories.

 Entity            describes contents of molecule
 Entity_poly_seq   gives sequence of polymers within molecule.
 Cell              cell dimensions
 Atom_sites        conversion from fractional to orthogonal 
 Symmetry          symmetry information ( space group etc)
 Symmetry_equiv    symmetry operators
 Struct_asym       content of asymmetric unit (eg: 2 molecules + 1 Zn)
 Struct_conn       any special connectivities (eg disulphides,
                          covalently bound sugars etc)
 atom_site         description of atomic position; name, plus xyz

A set of mmCIF coordinate handling routines have been written and tested. These routines transfer header information from input to output files in a similar way to the MTZLIB procedures. A further set of routines get and put coordinates as individual atom_site ``row'' by ``row''.

The CCP4 format will impose some restrictions to standard mmCIF loop_ structures. The atom_site records will be formatted. Each atom_site should occupy a single row, so that simple unix and awk utilities could still work. There will be a standard order within the loops. Many character data values will be restricted to char(8) (or less, rather than the mmCIF 80 character limit).

Martyn identified various problems, including.

(1) Syntax of atom selection
(2) Speed - much slower than ASCII reads
(3) FOO.label_seq_id only legitimate for atoms within a polymer. All others would have to be ``numbered'' by FOO.auth_seq_id [Note: Paula Fitzgerald detailed that as this meant HOHs and non-polymer ligands had no ``residue'' number and felt the mmCIF committee had not initially intended this to be the case.]
(4) esd instead of su

Gerard Kleywegt commented on the many practical problems of moving away from PDB atom format to mmCIF atom format. The program O for example would be very difficult to cope with extended chain_id fields.

Many programs manipulate PDB files, and these programs are easy to write, without the in-convenience of having to link to a subroutine library that handles the reading of mmCIF files. This objection is however not serious as any number of research sites load third party software and for example link perl packages into their perl setup. If a simple easy to use mmCIF coordinate subroutine library was readily available, then to write a simple program would not be difficult and would simply require an extra -lcif link at compile time. Gerard's comment that PDB format atom lines are predictable (i.e. one knows where to look for a value) and are unix friendly for the use of shell scripts is a real problem to shift to mmCIF format for in-house structure solution.

mmCIF tools and/or mmCIF format restrictions are required to gain acceptance. These tools should allow equivalents to grep and sort that do not inconvenience users.

Secondly mmCIF can allow for ``special'' characters that are also unix un-friendly. The libccif routines do allow for formatting and aligning columns, even when a value is represented as unknown or not applicable with the ``.'' and ``?'' characters. These values may be confusing to unix commands.

It was generally agreed that a coordinate mmCIF file should have as the first column _atom_site.group_PDB, i.e. the word "ATOM", and all columns for an atom be placed on a single line, extending the current mmCIF line limit from 80 characters to un-limited size. Peter Keller's libccif routines do allow for formatting and alignment of columns. This would make the files more unix friendly.

The community at large needs to be aware that one package alone shifting to using mmCIF for coordinates will require a conversion program for mmCIF_to_PDB with no lossage as one alternates between refinement and model building for example.

One unanimous request from the software developers presented to the deposition sites was to re-instate the ``prime'' character in atom names that is currently replaced with an ``*'', especially for nucleic acid and sugar atom names.

Liz Potterton gave a presentation on the problems and a possible solution to interactive or command file commands that are required to select atom/residue and their ranges as input to working crystallographic programs (see

5. External Reference Files (e.g. Refinement Dictionaries)

It is desirable to have a standard format for the ERFs which describe the expected geometric properties of common macromolecular ``monomers'', i.e. the amino-acids, nucleotides, or ligands contained within a structure.

Helen Berman and John Westbrook described the development of standard files for nucleic acids. Their method was outlined as,

 Select structures from the CSD
 Get average values for valence geometry
 Get an average coordinate set.
 Save history of source CSD files that contribute to the model}

One highlight found for the nucleic acid models is that there is a real difference in the distance and angle values found for sugars in different conformations.

A La Mode is an NDB environment for building models of ligand and monomer molecular components. It is based on queries to CSD, and allows simple analysis of the multiple models (see A La Mode at, or at for Europe

Kim Henrick present a brief view (see for some of the slides) of the new CCP4/REFMAC mmCIF refinement data item additions to the mmCIF dictionary. The additions (see "CCP4 proposed extensions" at involve both additions to existing categories in the chem_comp_group and the chem_link_group and new categories. The new categories, ccp4_lib_group and ccp4_chem_mod_group describe respectively atom type properties (equivalent to CNS files) and functional mmCIF data items that follow existing CNS and CHARMM chemical patches to MODIFY, DELETE, ADD existing chem_comp entries.

The CCP4/REFMAC proposal is also to transfer chemical information instructions from command files used in refinement procedures that generate the molecular description restraints to the header of the mmCIF coordinate file. Additions to the entity_group and struct_group categories allow for a complete description of all linkages and modifications to be held in the self defining mmCIF header.

Alexei Vagin has tools, SMILES2DICT and MAKEDICT, to create ERFs from coordinates, or SMILES strings. Additional tools use the ERFs, plus coordinates to generate lists of restraints for refinement.

Choices have to be made in whether all possible chemical species that are described by a current three letter code are held in the reference file or methods to generate them via a chem_mod instruction set. For example Alanine, ALA, how are D-ALA, L-ALA, Nterminal-D-ALA and Cterminal-D-ALA to be described, when they are components of a polymer chain and when they are the free amino acid complexed to a protein.

The CCP4 proposal is to hold some of this information in an extended _entity_poly_seq loop structure that will also cope with branched chain polymers;

Proposal for _entity_poly_seq

ASP 1 AA . n/a NH3
PRO 3 AA CIS   2 .

Linkages between different entities are listed in the standard category _struct_conn, however linkages common within all instances of a particular entity would be listed in the new definitions;

Proposal for _entity_link


SS AA CYS  7 CYS  96

Instances of other modifications to non-polymer constituents would be held in the structure;

Proposal for _struct_asym_mod


DEL-O1      263  MAN  .       Bb

6. Validation

A session was held on validation and to what extent the deposition centre carries out validation tests and to whom do they make the results available. Tom Taylor gave an overview of the contributing groups that are part of the European validation, CRITQUAL network (see the EC comprehensive validation package at These groups have devised a series of geometrical and structure factor tests that may be used to give confidence levels for how well a crystal structure results both fits the observed density and how the geometry agrees with targets. The Uppsala group has selected 25 structures that have structure factors deposited and have undertaken to refine (using CNS) and model build (using O) this set of structures, (see Validation using Structure Factors at A careful record of a manual examination of the density maps is then made, detailing for example possible disorder, side chain re-placement, differences in deposited water structure and possible sequence mis-matches. The other members of the CRITQUAL will then be asked to use their methods to give overall and per-residue quality indicators and then all the results will be correlated, to see how well the geometric quality indicators pinpoint problem regions. The group plan to publish their conclusions by the end of this year and at this stage the deposition centres may then apply the recommendations. The CRITQUAL group also includes an NMR component that will recommend a test method to use NOE's and restraints.

There followed a detailed discussion on what can be validated, and how any validation result or confidence level can be used and mis-used. Eleanor Dodson expressed the view that validation is part of structure solution and refinement, and as new knowledge is established, it can initially be used to validate errors, but in the long run it will be used quite properly during the process. This was met with some comment that as more geometrical tests between found and target values are used and felt required then the refinement programs will simply put restraints to match these and one is then in the position of aiming for what could be a less likely correct structure. This is particularly true for Xray structure determinations at resolutions at about 1.3 Ang . Zbyszek Otwinowski pointed out that for refinement of structures at this resolution one may make the choice to weight geometry input high and meet the target requirements or weight structure factor input high and then tend to get the more likely correct structure at the expense of target geometry.

There was universal agreement that deposition of experimental data is important and should be obligatory. There was a heated discussion over how to detect errors, and on basically the problem of what is experimental data and how biased are derived data from experiment data might be, including intensities.

It was felt that upon deposition gross errors (e.g. Ramachandran plots, poor electron density fit) be summarised and presented to the depositor for annotation. It was recognised that the end users of the macromolecular database can have a multitude of views on the data and that reliability indicators are required across all entries and per residue and per atom.

There were conflicting views on what to use as indicators of individual parameter reliability, with Phil Evans and Zbyszek Otwinowski favoring B-values while Garib Murshudov suggested that B-value is an estimation of displacement parameters but not reliability and that B-values are not reliable anyway. There is correlation between B and SU of atomic parameters but it is not 1:1. The use of real space correlation as reliability was questioned as it depends on the map calculated and that this indicator in principle is not an unbiased estimator.

There was some agreement on refinement programs being made to give better statistics and use refinement algorithms which can give standard uncertainties (SU's).

The only conclusion that the deposition centres therefore can make is to collect, store and derive everything and give choices to users of a search and retrieval system that will allow individuals to select an appropriate reliability indicator. Meaningful selection criteria can be readily built into a relational database system and a different users preferred, or coached, view point can be accommodated.

The deposition site can accept and store input quality indexes such as _struct_mon_prot.RSR_all . Deposited structure factors that contain data items that generate the phases used in the laboratory final map can be used to check and annotate density fit. Finally a data base global consistent quality index can be derived using deposited amplitudes and a standard tool such as SFCHECK.

Geometrical checks can be carried out in a similar manner, entry specific and global standards.

Unfortunately the workshop did not discuss to the same extent NMR validation. There was disagreement whether NMR and Xray structures should both have a common ``quality index'' that was applied for all entries. Janet Thornton suggested that this was essential.


A presentation (see for some of the slides) was given on the EBI's view on how a macromolecular database was to be organised in a relational database and how this structure could be populated. Apart from coordinates, validation data items, and experimental tables, the EBI database is primarily a hierarchy of tables based on consideration of an ENTRY as consisting of:

one or more  MacroMolecular Assemblies
    each of which contains one or more  Intact Biological Macromolecules
        each of which contains one or more  Chains
            each of which contains one or more  Residues
                each of which contains one or more  Atoms

At each level properties such as BoundMolecules, Features, and derived values are associated tables. In many cases the first three levels are actually identical. The first level represents the molecule(s) found by the experiment and can be generated automatically and presented to the depositor for checking. (see PQS automatic file server at

The second level is represented in the current PDB format guide by the SOURCE and COMPND sub-token MOL_ID. In the mmCIF dictionary there is currently no direct mapping for the PDB's MOL_ID. Problems with MOL_ID in the PDB arise as it is not used consistently. MOL_ID should be strictly defined and used as a classification of the overall found structure into, where possible, discrete polymer sub-component molecules. The sub-component molecules are biological sub-divisions rather than structure sub-divisions. The simplest example is a FAB/protein antigen complex that is the actual molecular assembly studied and presented in the resulting coordinates. However the complex in a biological sense consists of 2 MOL_ID's, the FAB chains and the protein antigen. The EBI's view is that further subdivision of the FAB molecule into the Heavy and Light chains is a structural sub-division and not a biological sub-division.

The current mmCIF definitions for and _struct_biol_gen allow for author choice in the number of and the content of's. This does not exactly match the EBI's view that sub-biological structures are not the same as dividing a structure such as a homo-tetramer into sets of chains to derive or compare structural components such as different chain-chain interfaces.

The following proposal summarizes some of the ideas discussed at the workshop for expressing higher level chemical and biological features of structures using the mmCIF ENTITY and _struct_biol categories, John Westbrook has now formalised a proposed mmCIF match to the PDB MOL_ID.

John has pointed out a further benefit of the addition of the parent_biol_id, in that sub-biological entities related to the full biological assembly can be enumerated. These sub-biological entities appear to be the closest reasonable match to the spirit of the PDB MOL_ID. In most cases, MOL_ID's are associated with realizable biological building blocks rather than individual mmCIF entities.

These are:
_entity.msd_parent_entity_id where the value of _entity.msd_parent_entity_id identifies the parent entity for cases in which an entity is assembled from a collection entities. Complex entities are assembled using the connectivity information in category entity_link.

_struct_biol.msd_parent_biol_id where the value of _struct_biol.msd_parent_biol_id identifies the parent structure for cases in which an biological assembly is generated from a collection biological subunits.

Proposed Example

PP     POLYMER   MAN .  'DNA strand'
DNA    POLYMER   NAT .  'Protein strand'
LIG    THING     SYN .  '15 2,3-dihydroxy-1,4-dithiobutane'
SOL    SOLVENT   .   .  'water solvent structure'

C    DNA   'DNA 20-mer first chain of duplex'
D    DNA   'DNA 20-mer second chain of duplex'
A    PP    'Protein transcription factor first chain of homo-dimer'
B    PP    'Protein transcription factor second chain of homo-dimer'

 PPDD1  'Protein DNA/Complex'  .
 DD1    'DNA duplex'           PPDD1
 PP1    'Protein dimer'        PPDD1

PPDD1    A   1_555
PPDD1    B   1_555
PPDD1    C   1_555
PPDD1    D   1_555
PP1      A   1_555
PP1      B   1_555
DD1      C   1_555
DD1      D   1_555

Newsletter contents... UP