Thoughts on use of XML

Martyn Winn
For convenience, I am going to call the putative CCP4 XML "XtalML". Though this may consist of several independent DTDs/schemas.

Where could we use XML?

Coordinate files

A new binary format is being implemented for coordinate data, with the ability to dump to PDB and mmCIF. The content of the binary format will be based on the mmCIF dictionary. Perhaps we can also dump to an XML file. This will be trivial if the DTD is based on the mmCIF dictionary, but will require complicated mapping otherwise.

MTZ and map files

I think there is no question of replacing MTZ and map files with an XML format. It might be feasible to replace just the header sections with an XML-style section.

However, since users do not normally view MTZ and map files directly, there is little point in this. Rather, they use the mtzlib routine LHPRT to print a text layout of the header (into the log file or via mtzdump). This routine could be adapted to produce header information in XML.

Log files

Automation relies on being able to track the progress of structure solution, and hence on the computer being able to read the log file. Therefore, there would be big advantages in replacing or supplementing the ad-hoc text log file with an XML document.

There are some parallels here with data harvesting. However, data harvesting is designed to be read only by the deposition centre and not by the user's software. Any changes would need to be agreed by EBI.

Documentation

This is currently in HTML. Upgrading to XML implies looking at the documentation as data rather than a simple document, and would involve substantial re-writing. Not sure that there's a good reason to do this.

ccp4i

Alun suggests the ccp4i database (file CCP4_DATABASE/database.def and related .def files) be converted to XML. Also, that program input be in XML. The latter only makes sense if it is machine-written, e.g. from ccp4i.

XML vs. mmCIF

There are obvious similarities between XML and mmCIF. Both aim at self-describing data. The XML DTD plays the role of the mmCIF dictionary. What are the differences?

Advantages of XML

  1. Can be written/read by generic tools.

Advantages of mmCIF

  1. Already established in crystallography.
  2. Includes data typing.
  3. Dictionary includes semantics (_item_description and _item_examples).
XML schemas address some of these differences.

The DTD for XtalML

We can't do anything without having a DTD or schema. This should be well thought out, as it is likely to get locked in. Much of it could be based on existing mmCIF dictionary - is there an easy way to convert one to the other? (I have a note which says that Peter Murray-Rust's JUMBO can do this.) But we are likely to want to add extra tags, e.g. specific to log file processing. We should also check existing DTDs for relevance. See list at:

www.xml.org.

under "science". Includes CML, MathML and a few biological ones.

Implementing data files

There is the question of whether we embed XML in other files, or whether we use XML-only files. While it would be nice to have true XML files, putting up such a barrier may in practice stop us doing anything! Or at least we need an "everything else" tag, such as the <pre> tag used in HTML log files. Or put XML and non-XML stuff in different files.

Problem of keeping a continuously updated file (e.g. log file) valid XML.

XSL

We need style! XSL can also do processing, such as selecting parts of file relevant to particular context. Cf summary logfiles.

What do we do now?

We have been thinking about this for some time, see Liz's document. We can't make anything public yet (waiting on Netscape), but probably time to do something internally.
  1. Decide on area of interest (first section above).
  2. Write draft DTD/schema.
  3. Decide on best way to read and write data files.

Links


m.d.winn@ccp4.ac.uk
Last modified: Mon Sep 18 10:17:48 BST 2000