CCP4 HAPPy ♦ Documentation: design/input.html
Printable version

Links in this document:
|XML project file||input XML DTD schema||TRUNCATE information|

Data required to start HAPPy

Starting HAPPy

HAPPy will be started from the command line by specifying an XML project file. This project file specifies everything required for the HAPPy project to run.

The project file will ultimately be generated by a CCP4i interface to HAPPy, or by a previous stage in an automated pipeline.

For text users this adds a burden of writing XML instead of plain text, but the clarity of the XML format is important.

TRUNCATE information

Some valuable information should be extracted from TRUNCATE runs.

Such information should include:

  • Wilson plots?
  • Twinning information (NB. we assume data was not twinned or has been detwinned).
  • Anisotropy information.

This information will be obtained through the files pointed to by <truncate_info> which should be XML output from TRUNCATE.

HAPPy project file format

We have now produced an initial HAPPy XML input format, schema and Python parser. The schema is currently specified in a DTD file. Here is the input XML DTD schema. This may later be re-implemented in Schema.

File locations must be specified either with absolute paths, or relative to the directory from which HAPPy is started. The project directory can be specified with the dir attribute to <project>, which can be overridden from the command line when running HAPPy. If neither is specified the project directory defaults to the directory containing the input file.

  • The root element is <project>. The <project> element has required attributes name (the project name), phasing_mode (MAD, SAD, SIRAS, MIRAS or MIR), date_created (in YYYY-MM-DD format), and created_by (a string containing information on the source of the input file).

    <project> contains:

    • One <target> element providing basic crystallographic information. <target> contains:

      • An optional <consider_space_group_list> element containing at least one <space_group> element. Each <space_group> contains the number, old, Hall, Herman-Mauguin or short symbol of a space group to be considered in the solution process. If no space groups are specified, then the space group will be read from the input reflection data files, and provided the same single space group is specified in all files it will be adopted. If neither source of space group information is valid HAPPy will exit.
      • A <monomers> element which contains one or more <monomer>s which make up the target.

        A <monomer> has a mandatory name attribute, giving the name of the monomer. It has an optional attribute, type which is either protein or nucleic_acid, specifying the type of molecule i.e. protein or DNA/RNA (default assumed is protein).

        • The number of copies of the monomer in the asymmetric unit can be specified by providing a <number_in_asu> element.

        Each <monomer> provides information on the its content through one of the two following options.

        • A <sequence> element. This contains either the protein or DNA/RNA sequence string or is empty and has attributes file and format pointing to a file containing the sequence in the given format. At present format can only be FASTA (and defaults to FASTA). The sequence whose name is specified by the <monomer> name attribute will be used.
        • Or the number of residues/bases in the monomer is given through <number_of_residues> (for protein) or <number_of_bases> (DNA/RNA) and optionally the molecule weight through <weight>.
    • An optional <native> element which describes observations of the native crystal (if present). A <native> is a crystal in the MTZ Project/Crystal/Dataset hierarchy. NB this is NOT the 0.5 * (Fplus+Fminus) of an anomalous dataset. That should be marked up as the F column in a <columns> element of type DANO (see below), which should be in an <dataset> which could itself be in a <native> (e.g. sulphur SAD) or <derivative>. It has one attribute, name, giving the name to be associated with the dataset. <native> contains:

      • At least one <dataset> element specifying all the observations of the native crystal, e.g. the non-anomalous observations of the native crystal in SIR or the peak observations of the native in a sulphur SAD.
    • At least one <derivative> element. Each <derivative> describes the observations of one derivative crystal. Like the <native> element, a <derivative> is also a crystal in the MTZ Project/Crystal/Dataset hierarchy. It has two attributes: name, the name by which the derivative will be known in the project (this must be unique within the project), and atom, the heavy atom name. <derivative> contains:

      • An optional <nsites> element specifying the estimated number of scatter sites in the asymmetric unit.
      • An optional <min_separation> element specifying the minimum separation (in Angstroms) to be allowed between scatterers. If unspecified an appropriate default is used.
      • An optional <allow_special_positions> element which if present states that scatterers may be located at special positions.
      • At least one <dataset> element specifying all the observations of this derivative, e.g. the peak and inflection wavelength observations in a MAD experiment.
    • An optional <freer> element with attributes file and label. specifying the mtz file and column label for the FreeR flag to use. If none is specified then HAPPy will generate its own FreeR flag.

  • The <dataset> elements describe individual observations of a crystal and are equivalent to dataset in the MTZ Project/Crystal/Dataset hierarchy. There are three attributes, type and file, which are required, and twinning, which is optional. The type of observation, type, can be non-anom (for non-anomalous observations e.g. SIR), and peak, infl, lrem or hrem for the components of anomalous scattering experiments. SAD observations should be of type "peak". The second attribute, file, specifies the filename of the file containing the reflection data for this observation. The final attribute, twinning, can be none, twinned, or detwinned specifying whether the data provided is known to be either not twinned, twinned, or successfully detwinned respectively. If the twinning attribute is not present then no knowledge about twinning is assumed.

    Each <dataset> contains:

    • An optional <truncate_info> element which has a file attribute specifying the location of a file containing cogent information from the TRUNCATE logs (format of this file TBD, may be the truncate log itself for parsing by HAPPy). See TRUNCATE information.
    • An optional <comment> element containing comments on this observation.
    • An optional <wavelength> element specifying the wavelength of the observations. If an inconsistent wavelength is found in the reflection file then HAPPy will exit. If no <wavelength> is specified it will be taken from the reflection file if needed.
    • An optional <scatterer_info> element containing the scattering coefficients, f' and f'', given in elements <f_primed> and <f_double_primed>.
    • At least one <columns> element.
      • The <columns> element specifies the columns in the reflection file corresponding to a particular type of reflection data. The columns type attribute may be F, ANO or DANO.

        A columns of type F specifies a non-anomalous F and its error, and requires attributes F and SIGF giving the appropriate column labels in the reflection file.

        A columns of type ANO specifies anomalous scattering data, and requires the F+ and F- data and their errors, whose column labels are given by the attributes Fplus, SIGFplus, Fminus and SIGFminus.

        A columns of type DANO specifies anomalous average F and difference data identified with attributes F and SIGF, DANO and SIGDANO.

Example input files

NB the first release of HAPPy will only phase in SAD mode.

SAD example
<?xml version='1.0'?> 
<!DOCTYPE project SYSTEM "http://www.ccp4.ac.uk/HAPPy/xml/schemas/project.dtd"> 
 
<project name="gere_sad" phasing_mode="SAD" date_created="2005-06-21" created_by="Dan Rolfe's fingers"> 
   
  <target> 
    <monomers> 
      <monomer name="gere"> 
        <number_of_residues>444</number_of_residues> 
      </monomer> 
    </monomers> 
  </target> 
   
  <derivative name="Se" atom="Se"> 
    
    <known_sites file="gere_sites.pdb" /> 
  
    <estimated_nsites>12</estimated_nsites> 
     
    <dataset type="peak" file="gere_peak.mtz"> 
      <truncate_info file="dummy_peak.log"/> 
      <comment>This is a comment.</comment> 
      <wavelength>0.975</wavelength> 
      <scatterer_info> 
        <f_primed>-4</f_primed> 
        <f_double_primed>4</f_double_primed> 
      </scatterer_info> 
      <columns type="ANO" Fplus="F_peak(+)" SIGFplus="SIGF_peak(+)" 
        Fminus="F_peak(-)" SIGFminus="SIGF_peak(-)"/> 
      <columns type="DANO" F="F_peak" SIGF="SIGF_peak" 
        DANO="DANO_peak" SIGDANO="SIGDANO_peak"/> 
    </dataset> 
        
  </derivative> 
   
  <!--<known_phased_data file="phased_elsewhere.mtz" F="Flabel" SIGF="SIGFlabel" PHIC="PHIClabel" FOM="FOMlabel" />--> 
  <!--<known_model file="the_model.pdb" />--> 
 
</project> 

SIRAS example
<?xml version='1.0'?> 
<!DOCTYPE project SYSTEM "http://www.ccp4.ac.uk/HAPPy/xml/schemas/project.dtd"> 
 
<project name="1GXU" phasing_mode="SIRAS" date_created="2005-06-16" created_by="Dan Rolfe's fingers"> 
   
  <target> 
 
    <consider_space_group_list> 
      <space_group>H 3 2</space_group> 
    </consider_space_group_list> 
 
    <monomers> 
      <monomer name="1gxu"> 
        <sequence> 
          MAKNTSCGVQ LRIRGKVQGV GFRPFVWQLA QQLNLHGDVC 
          NDGDGVEVRL REDPETFLVQ LYQHCPPLAR IDSVEREPFI 
          WSALPTEFTI R 
        </sequence> 
      </monomer> 
    </monomers> 
 
  </target> 
   
  <native name="1gxu"> 
     
    <dataset type="non-anom" file="hypF-1gxu-1gxt-HG_scaleit1_eleanor.mtz"> 
      <truncate_info file="dummy_nat.log"/> 
      <columns type="F" F="FP1gxu" SIGF="SIGFP1gxu"/> 
    </dataset> 
     
  </native> 
   
  <derivative name="Hg" atom="Hg"> 
     
    <estimated_nsites>1</estimated_nsites> 
     
    <dataset type="non-anom" file="hypF-1gxu-1gxt-HG_scaleit1_eleanor.mtz"> 
      <truncate_info file="dummy_hg.log"/> 
      <columns type="ANO" Fplus="F_Hg(+)" SIGFplus="SIGF_Hg(+)" Fminus="F_Hg(-)" SIGFminus="SIGF_Hg(-)"/> 
    </dataset> 
     
  </derivative> 
   
  <freer file="hypF-1gxu-1gxt-HG_scaleit1_eleanor.mtz" label="FREE"/> 
   
</project> 

MAD example
<?xml version='1.0'?> 
<!DOCTYPE project SYSTEM "http://www.ccp4.ac.uk/HAPPy/xml/schemas/project.dtd"> 
 
<project name="gere" phasing_mode="MAD" date_created="2005-06-21" created_by="Dan Rolfe's fingers"> 
   
  <target> 
    <monomers> 
      <monomer name="gere"> 
        <number_of_residues>444</number_of_residues> 
      </monomer> 
    </monomers> 
  </target> 
   
  <derivative name="Se" atom="Se"> 
     
    <estimated_nsites>12</estimated_nsites> 
     
    <dataset type="peak" file="gere_peak.mtz" twinning="none"> 
      <truncate_info file="dummy_peak.log"/> 
      <comment>This is a comment.</comment> 
      <wavelength>0.975</wavelength> 
      <scatterer_info> 
        <f_primed>-4</f_primed> 
        <f_double_primed>4</f_double_primed> 
      </scatterer_info> 
      <columns type="DANO" F="F_peak" SIGF="SIGF_peak" DANO="DANO_peak" SIGDANO="SIGDANO_peak"/> 
    </dataset> 
     
    <dataset type="infl" file="gere_infl.mtz" twinning="none"> 
      <truncate_info file="dummy_infl.log"/> 
      <comment>This is also a comment.</comment> 
      <wavelength>0.980</wavelength> 
      <scatterer_info> 
        <f_primed>-7</f_primed> 
        <f_double_primed>3</f_double_primed> 
      </scatterer_info> 
      <columns type="DANO" F="F_infl" SIGF="SIGF_infl" DANO="DANO_infl" SIGDANO="SIGDANO_infl"/> 
    </dataset> 
     
  </derivative> 
   
</project> 

Links in this document:
|XML project file||input XML DTD schema||TRUNCATE information|

<allow images to see address> Get Firefox Top of page