CCP4 HAPPy ♦ Documentation: design/database.html
Printable version

Links in this document:
|example program flow||Graphviz|

HAPPy database

The current database API (in HappyDatabase.py) used by HAPPy is purely a program flow history and tracking database. A basic program flow implementation is complete and has been in use for over 6 months. The data storage and tracking is only partially complete at present.

We anticipate that it will be replaced by or become a wrapper to the next generation CCP4i database, which is being developed by Wanjuan (Wendy) Yang and Peter Briggs with extensive discussion of HAPPy requirements.

The emphasis in developing the database elements of HAPPy has therefore been to produce/implement what is necessary for HAPPy to develop, trying out new ideas which are informing some of the CCP4i development. This has been done bearing in mind that a lot more work in this area is being by Wendy and Pete, so care has been taken not to unnecesarily duplicate effort or do more than is immediately useful to HAPPy.

DataStore

Currently partly function is a database for storing information in core HAPPy classes automatically in XML files. This is currently very basic, and reading the information back in is not yet implemented. The idea being implemented is simply base each core class on a datastore class which automatically assigns a unique key to each instance of the class and serializes every instance of that class to XML.

The "core" classes are those which contain key data which may be needed again later for analysis and possibly for restarting HAPPy from points later in the run. At present the core classes are those listed below along with their DataStoreTable table names (see below). HappyMolXtal.Crystal should probably be made a core class too.

  • HappyExperimentalData.ExperimentalDataHandler (experimental_data)
  • HappyDatabase._Node (history_node)
  • HappyDatabase._NodeLink (history_node_link)
  • HappyMolXtal.Limits (limits)
  • HappyProject.PhasingAttempt (phasing_attempt)
  • HappyPhasing.ExpPhasedDataInfo (phasing_result)
  • HappyPhasing.Substructure (substructure)
  • HappyMolXtal.Target (target)

The key elements of this approach are as follows.

  • There is a singleton HappyDatabase.DataStore which is an instance of the HappyDatabase._DataStoreDB class. This object controls the datastore.
  • Each core class uses HappyDatabase.DataStoreTable as its base class, so-called because each derived class is basically a database table, with each instance being a record in that table.
    • Each core class __init__ method calls the DataStoreTable.__init__ method, passing it a table name for the table corresponding to the class. This name is used when serialising the class to XML and in generating a unique reference for instances of the class.
      • The DataStoreTable.__init__ obtains a numerical id from DataStore which unique for that class within the HAPPy project.
    • Each core class has an xml method (which uses the HappyXML.XML class) to serialise all the required information in that class to XML.
      • Where a core class contains instances of another core class the XML link for this reference should be generated using the add_xml_ref_to_ method of the object.
  • When the Datastore.close method is called all instances of DataStoreTable objects are written to disc using their XML methods. This means that only the final state of these objects is saved, so if it is necessary to save versions at different stages in the pipeline multiple instances of objects should be created using the copy method of the core class.
  • When new copies of core classes are required care should be taken. The DataStoreTable class provides a default copy method which creates a deepcopy of an object with a new from DataStore and registers it with the DataStore. The copy provided this way therefore contains complete independent copies of all data and objects in the original object; the objects within the copy are no longer bound to those in the object from which the copy was made. This means that any changes to objects in the source object will not occur to the corresponding objects in the copy. In many cases this is not appropriate, in particular if the core class instance (X) in question contains instances of a mutable core class (Y), X.copy will return X', a copy of X containing Y', a copy of Y. Y' will have the same id as Y but will not be bound to it. If Y is then changed, this change will not be seen in Y'. Which object is really Y? It is therefore essential that if a core class contains instances of another core class or mutable object which must be unique, a custom copy method is written to ensure correct binding of objects in the copy. This copy method must make sure the copy gets a new id and is registered. This is actually quite easy. See HappyPhasing.Substructure and HappyProject.PhasingAttempt for examples of core classes with custom copy methods.

Possible improvements

There is currently no implementation to restore classes from the XML serialisations. Custom methods will have to be written for each core class since the XML does not contain all attributes of the classes; this is intentional, since information which should be internal to HAPPy should be stored elsewhere if it cannot be recreated. To correctly restore classes which contain other core classes it might be necessary for two steps: one to re-instantiate all core objects minus their bindings to other core objects, then one pass to rebind the links between them once all instances exist.

This DataStore is probably temporary; it is sort of a very crude very partial implemention of an object oriented database (or would be if the above issues were cleanly handled). A more generic/standard approach should be employed later, probably using the next generation CCP4i database.

Some basic methods for listing/returning all objects in a particular table or specific ids would make it possible to further implement a more database like use of core data in HAPPy, making the possible transition to the CCP4i database easier later on.

History and program flow

The flow of the code is considered as a series of actions, or nodes, where the code makes a decision or processes some data. These nodes are connected by node_links each of which links one node to one next node. Using the node_links nodes can have multiple connections. The node_links are internal to the implementation. Connection are easily made through the API, described below. The network of connected nodes then represents the flow of execution of the code.

At present the data flow is not handled.

A figure showing the example program flow.

Nodes

Nodes in the database are HappyDatabase._Node objects and so are derived from the DataStoreTable class. They have the following attributes.
  • A unique numerical id (inherited from DataStoreTable).
  • The time at which the node was created, e.g. before running some associated CCP4 program.
  • The time at which the node completed, e.g. after completion of the associated CCP4 program.
  • Text describing the node, e.g. "Create master MTZ".
  • A list of the ids of the nodes immediately preceding the node.
  • A list of notes, one for each previous node, which can be used to describe something about the reason for this route through the code. For example if the previous node was a fork into two separate paths, each one for a different space group, the link to the current node might have a note indicating which space group it corresponds to.
  • A list of the ids of the nodes immediately following the node.
  • A node type string, indicating the type of node, e.g the successful end of a path (END) or a fork into multiple paths (FORK). At present these types are just used to distinguish node types at the visualization stage.
  • A status for the node (which can change during the life of the node), currently one of SUCCESS, FAILED, KILLED or RUNNING.
Note that the list of previous nodes, notes and next nodes is an implicit property of the node, and is in fact implemented internally through a list of node_links in the database. Each node_link is a HappyDatabase._NodeLink object, derived from the DataStoreTable., each of which has a previous node, next node and note. This makes it clearer how the note describes the connection from one node to the next, effectively a small section of a particular route through the code.

Possible improvements

  • Core classes (e.g. PhasingAttempt) should be directly associated with each node, through NodeDataLink objects. These should contain the node id, a reference to the core object, and and attribute explaining the nature of that link, e.g. whether object was created in the node, modified or just looked at. This will enable tracking of dataflow.

  • Add more specific node types, e.g. CCP4 program nodes with hklin etc. Different node types may have different/additional attributes, e.g. a CCP4 node with command file, parameters and output logs. Some of these may overlap with input/output data.

  • Store the data on disc or in a database for later retrieval. We hope eventually to use the BioXHIT database for this.

  • Grouping/nesting of nodes. This would mean complicated tasks doing one thing could be grouped together and expanded/shrunk as appropriate. A whole HAPPy run may be a single group of nodes in a complete structural solution process. Workflows/dataflows/data could be linked between different database projects, e.g. as Graeme suggested with processing the observations from MAD project as separate projects which are also linked into the overall MAD structural solution project.

Implementation

The Historydatabase API is HappyDatabase.py.

A Graphviz dot file is written in the project root representing a flow diagram of the full history of the HAPPy run. If the GraphVizFormat option in the .happy is set then a postscript and/or svg version of the diagram is also produced (this assume that GraphViz is installed with the dot program in the system path). See the example dothappy file. An example program flow figure.

HappyDatabase API

  • Any module which wishes to access the history database class should include the following import line.
    from HappyDatabase import History
  • The History.create(project_name) is called once in the initial Project object initialization to initialize the History for the run.

  • To add a new node to the database use node=History.new_node(text, prev_nodes=prev_nodes, notes=notes, type=type).

    • text is the description of the node.
    • prev_nodes is a list of the node objects immediately preceding this node. Omission of this parameter is only acceptable for the first node in a run (as created in the __init__ method of the initial Project.
    • notes is a list of notes, one for each element in prev_nodes. Each note is a string (which can be an empty string) which describes something about the path between the previous node and this one, e.g. that it was a fork for a particular space group. notes must either be unset, or be a list of strings with the same number of elements as prev_nodes.
    • type is the node type (as a string). It can be one of NORMAL, START, SELECT, FORK, COPY or END. It defaults to NORMAL if input paths were specified, START if not. Normal nodes are those which have the same single path as their input and output. FORK and COPY type nodes both split one path into multiple paths, but while the data associated with all output paths of a COPY node are identical, the output paths of a FORK node have different data. This distinction may become irrelevant once nodes have the individual input and output data specified.

    node is a HappyDatabase.Node object corresponding to this node.

    The node is assigned the status RUNNING, which can only be changed with node.finish() (see below).

    A new node cannot be created until all its prev_nodes no longer have the RUNNING status.

  • To complete a node whose Node object is node, use node.finish(status=status, newtype=newtype).

    • status is the end status of the node, which can be one of SUCCESS, FAILED or KILLED. The default is SUCCESS.
    • newtype can be used to change the type of node, e.g. if a process failed leading to a NORMAL node becoming and END node.

Usage within HAPPy

Creation and updating of database nodes within HAPPy is done through functions in the HappyProject module and its PhasingAttempt class. These methods handle the internal book-keeping required to correctly use the database, in particular remembering the previous node along a given path, any notes associated with a path, and the forking and combining of multiple pathways. Paths or sections of paths through the execution of a HAPPy run map directly onto HappyProject.PhasingAttempt objects.

  • To create a new node along the path whose project object is PhasingAttempt, use node=PhasingAttempt.new_node(text, nodetype=nodetype, name=name).

    • The parameters are the same as for node=History.new_node(), and prev_nodes is automatically set.
    • name is a name for the node. This is included in the name of the node directory and is available simply to make navigation the node directory structure easier. See below for information about node directories.

    node is a HappyDatabase.Node object, which should therefore be completed using node.finish() as described above.

  • Each PhasingAttempt can have a note string attached to it using PhasingAttempt.set_note(note). This note string will then be used as the note for any new node along the path associated with that PhasingAttempt.

  • The PhasingAttempt.fork* methods create the new nodes in the database and set appropriate notes for the new PhasingAttempt objects.

  • A node which selects new PhasingAttempt objects from an input list of PhasingAttempt objects, projects, should be created using node=HappyPhasingAttempt.create_select_node(projects, text). This node must then be completed with node.finish().

Node directories

When a node is created through the PhasingAttempt new_node or fork* methods, a directory specifically for that node is created in nodes subdirectory of the project run directory. This is where all files created within that node should be stored.

The PhasingAttempt.new_filename_base method returns a new unique filename stub within the node directory for the current active node for the PhasingAttempt. Repeatedly calling it while the same node is active returns a new unique stub each time within the same node directory. Basing the names of all filenames for new files on stubs from this function will ensure all files generated in a node are put in the correct node directory. PhasingAttempt.new_filename_base will raise an exception if called when there is no active node for that phasing attempt. During the main run of the pipeline (i.e. when HAPPy is actually doing crystallographic stuff) everything meaningful should be done in a node (to enable tracking), so (IMHO) discouraging storing the files outside the node is a good thing.

Links in this document:
|example program flow||Graphviz|

<allow images to see address> Get Firefox Top of page