WA4 4AD, U.K.
The EBI plan to be in a position to accept harvest files in Autumn 1999. Meanwhile, changes are being made to CCP4, MOSFLM and other common programs to produce harvest files. In this article, I will describe the relevant changes to CCP4.
This has been implemented in CCP4 by adding information on Project and Dataset names to the header of the MTZ file. In a merged MTZ file, datasets are held as one or more data columns. In addition to the label and type attributes, each column now has an extra attribute specifying to which dataset it belongs. A list of all datasets included in the file, with the corresponding Project and Dataset names, is held separately in the MTZ header.
The code changes necessary to manipulate this information were included in CCP4 release 3.5. Ideally, dataset information should be added to the MTZ file at the beginning, e.g. in MOSFLM, but this information can be added at any time, most conveniently with the program CAD. Once the information is in the MTZ file, it can be checked by running mtzdmp which shows all the MTZ header information (go on, try it!), including the list of datasets:
* Number of Datasets = 4 * Dataset ID, protein name, dataset name: 1 TOXD NATIVE 2 TOXD DERIV_AU 3 TOXD DERIV_MM 4 TOXD DERIV_Iand the datasets which each column corresponds to:
* Column Labels : H K L FTOXD3 SIGFTOXD3 ANAU20 SIGANAU20 FAU20 SIGFAU20 FMM11 SIGFMM11 FI100 SIGFI100 FreeR_flag * Column Types : H H H F Q D Q F Q F Q F Q I * Associated datasets : 1 1 1 1 1 2 2 2 2 3 3 4 4 1
In CCP4, columns to be used are selected from the MTZ file by the LABIN keyword; for example, the command
LABIN FP=FMM11 SIGFP=SIGFMM11tells the program to use the 10th and 11th columns. In addition, the program now also knows that these columns are from the 3rd dataset, with Project Name TOXD and Dataset Name DERIV_MM.
Unmerged or multi-record MTZ files are treated slightly differently. In this case, a particular column may correspond to several datasets, distinguished by different batch numbers. Datasets are therefore attached to batches rather than columns, and a pointer to the relevant dataset is held in the batch header.
As an aside, classifying MTZ columns according to dataset has other uses. Previously, it was assumed that columns existed as independent entities, but this is clearly not the case, for example F(+) and F(-) columns, or F and sigmaF columns. Some programs now use dataset information to check for certain dependencies, for example the program REINDEX may need to swap F(+) and F(-) columns and therefore needs to identify which F(+) column goes with which F(-) column.
The environment variable $HARVESTHOME defaults to the user's home directory, but could be changed, for example, to a group project directory.
At the end of a project, the entire contents of the directory $HARVESTHOME/DepositFiles/<projectname> can be sent to the deposition centre for processing. Note that, because of the file-naming scheme, only the last run of a particular program with a particular dataset will be preserved, and it is the user's responsibility to ensure that this is the authoratative version. The USECWD keyword can be used to send deposit files from speculative runs to the local directory rather than the official project directory. This keyword can also be used when the program is being run on a machine without access to the directory $HARVESTHOME, in which case the user must transfer the deposition file afterwards.
In summary, the extra keywords associated with harvesting that will be included in most programs are:
data_TOXD[NATIVE] _entry.id TOXD _diffrn.id NATIVE _audit.creation_date 1999-07-08T11:19:51+01:00 _software.classification phasing _software.contact_author 'Z.Otwinowski or E.Dodson' _software.contact_author_email 'firstname.lastname@example.org, email@example.com' _software.description 'maximum likelihood heavy atom refinement & phase calculation' _software.name mlphare _software.version CCP4_3.5This is followed by details such as the cell dimensions and symmetry information, and then by a summary of the results, for example the figures of merit for the phases obtained:
loop_ _phasing_MIR_shell.d_res_high _phasing_MIR_shell.d_res_low _phasing_MIR_shell.reflns _phasing_MIR_shell.fom _phasing_MIR_shell.reflns_centric _phasing_MIR_shell.fom_centric _phasing_MIR_shell.reflns_acentric _phasing_MIR_shell.fom_acentric 9.56 15.00 61 0.484 41 0.553 20 0.343 7.01 9.56 80 0.315 36 0.423 44 0.227 5.54 7.01 120 0.351 45 0.502 75 0.261 4.58 5.54 186 0.338 61 0.506 125 0.256 3.90 4.58 255 0.327 68 0.484 187 0.270 3.40 3.90 345 0.276 86 0.417 259 0.230 3.01 3.40 430 0.271 90 0.446 340 0.225 2.70 3.01 536 0.287 108 0.454 428 0.245The deposit files should be easily readable, but they should not be altered - they represent an authentic record of the structure solution process.