GenomeHistory 2.0 Documentation

Revision History:

9-29-05:  Added a feature (ol=ON) which allows GenomeHistory to print the lengths of duplicate pairs found and the number of aligned residues for the pair.

10-1-13:  Version 2.0 of GenomeHistory. New features include: 12-1-17:  When searching a list of genes, it is now possible to ignore hits to other genes on that list (il=ON).

Downloading GenomeHistory 2.0

The current release of GenomeHistory is avaliable from our website


    See the INSTALL file included with GenomeHistory for installation instructions.

Citing GenomeHistory:

Conant, G. C. and Wagner, A. (2002) GenomeHistory: A software tool and its application to fully sequenced genomes, Nucleic Acids Research, 30(15): 3378-3386 [Reprint]

Running GenomeHistory:

 GenomeHistory can be invoked either with command-line options or through the use of a command file.  An example command file is included.  The command-line options can be obtained by typing % with no options.  Possible options are:

Long Option


Short Option



Protein sequences pi =<protein sequences file> orf_trans.fasta
ORF sequences ni =<nucleotide sequences file> orf_coding.fasta
Output file o =<output file> genome_ks.txt
Error file e =<error file>  genome_ks.err
Gene file gf =<Gene name file> NONE
Gene Name Field gn =<Gene name field #> 2
BLAST matrix location bl = /usr/local/wublast/BLOSUM62 NONE
Only Top BLAST match tb =YES/NO NO
BLAST threshold bt =0.001 0.01
Minimum ORF Translation length mo =55 40
Minimum Number Aligned Residues ma =45 40
Percent identity threshold pt =0.70 0.50
Output Length Information ol =ON/OFF OFF
Warnings w =ON/OFF OFF
Genetic code gc =<U/VM/YM/IM/MM/CN/EM> U (Universal)
Ordered Gene List og =YES/NO NO
Ignore List Gene Hits il =ON/OFF OFF
Checkpoint cp =ON/OFF OFF
Alignment method am =<GLOBAL/LOCAL> GLOBAL
Exclude Gaps eg =<YES/NO> YES

Any combinations of options may be used.  When used on the command line, only the short option format is allowed.  The command file interprets the same set of options, but allows either format.  Note that in both cases the equal (=) sign is required.  Options are not case-sensitive (but filenames may be).

Example command-line invocation:

 % pa=my_genome.fasta na=my_genome_nuc.fasta bl=/usr/local/wublast/BLOSUM62

Example command file invocation:

% genome_analysis.txt

Generating the example dataset:

GenomeHistory 2 includes a small example dataset (a portion of the yeast genome) that can be used to test your installation of GenomeHistory.  From the GenomeHistory_2.0 directory, type (% is your prompt)
% test_analysis.txt
This will create two new files, test_ks.txt and test_ks.err.  These files should be identical to the example files provided, example_ks.txt and example_ks.err.  You can compare them using the UNIX "diff"command:
% diff test_ks.txt example_ks.txt
% diff test_ks.err example_ks.err

In theory, "diff" should not print anything. However, if you are using a different version of clustalw (not 1.82), there may be (mostly minor) differences.

Notes and Tips:

Locating nucleotide sequences corresponding to protein sequences:

One of the chief difficulties with the sort of genome analysis performed by GenomeHistory is matching nucleotide sequences to protein sequences.  By default, GenomeHistory 2 does this using keywords in the FASTA header files (see the Gene Name Field option). Unfortuntately, some genomes do not provide identifiers that allow one to one matching between protein sequences and their corresponding nucleotide sequences.   This problem can be overcome with a second matching method if the genome's ORFs and protein sequences are listed in a one to one order (meaning protein sequence number 1 is the translation of nucleotide sequence number 1 etc.).    This second method  is invoked with the Ordered gene list = yes option and simply uses the order that genes are listed in the two files to identify nucleotide sequences.  Microbial genomes seem particularly suited for use with this option, as alternative splicing and gene identification are less of a problem for these genomes.

Using the Genename field to locate nucleotide sequences:

If you are experiencing numerous errors where GenomeHistory is not able to find nucleotide sequences for genes, you may want to experiment with the "Gene Name Field =" option.  Sequences in the nucleotide files may have slightly different names than those in the protein sequence file, but the unquie identifiers should match.   For instance, the yeast genome stores unique names in field #2, while Arabidopsis uses field 3.  Note that this option will only be effective if there is a unquie identifier common to both protein and nucleotide sequences (for instance the YXX### format yeast gene identifiers).  If you cannot find such an identifier, you may want to experiment with the Ordered gene list = yes option described above.


The running time of GenomeHistory_2 is in large part dictated by the number of BLAST hits retained for each gene.  In general, the lower (smaller E-value) the BLAST threshold is to, the fewer gene pairs will be analyzed and the faster the analysis will be.  The trade-off, of course, is the potential to miss significant hits.  We have found values around E < 0.00001 to be quite conservative for the yeast genome.

BLAST in non-default locations:

BLAST expects to be installed at /usr/ncbi/blast.  If this is not where you have installed BLAST, you will need to use the "BLAST matrix location" option to locate your installation.  For instance bl=/usr/local/wublast/BLOSUM62

Genetic Codes:

The Genetic code option takes on the following possible values:

Using the "Only Top BLAST match=YES" option:

When this option is used, Ks and Ka is calculated for each gene and its top BLAST hit, no matter what the E-value of that hit.  Thus, the "BLAST threshold" option is meaningless in this context.

Repeated analysis of gene pairs:

Gene pairs could potentially represented (and analyzed) in two ways (Gene 1, Gene 2) and (Gene 2, Gene 1).  Under normal circumstances, GenomeHistory will only analyze one of these two combinations (the first one that occurs).  However if the option "Only Top BLAST match = YES" is set, or if the option "Gene file = <filename>" is used, both comparisions may be made, since these two options may be used in analyses where it is not desirable to exclude reciprocal comparisions.

Identical Sequences:

Each sequence pair for which Ks/Ka are calculated is first checked to be sure that the two sequences are not identical.  If they are, the word IDENTICAL is entered in the output rather than values.

Saturation in Ks values:

If the divergence time between two gene pairs is sufficiently long, there may no longer be sufficient information in the two sequences to estimate Ks.  Such sequences are said to be saturated.  Unfortunately, numerical algorithms for estimating Ks under likelihood cannot directly determine that sequences are saturated.  Instead we have implemented a simple test:  After the likelihood maximization is complete, the Ks value for the pair is multiplied by 10 and the likelihood is recalculated.  If this likelihood is as high or higher than the original likelihood, that indicates that the sequence pair has saturated Ks, and that Ks value in the output is marked with an *.  Of course, this test does not address the problem of the large variances associated with high, but unsaturated, Ks values.  Researchers will want to remember this issue when selecting duplicate pairs for analyses.  A forthcoming manscript will discuss this issue in more detail (Hahn, Conant and Wagner, Journal of Molecular Evolution, in press)


 When the "Checkpoint" option is ON, GenomeHistory will look for a file called .GenomeHistory.chkpnt.(argfile) (where argfile is the command file) or .GenomeHistory.chkpnt.cmdline (if invoking from the command line).  If this file is found, GenomeHistory will restart from the point listed in this file.  This option is useful for stopping jobs and restarting them at a later time.    For checkpointing to work properly, you must specify the same output and error filenames on the restart as were originally used.
Note that the checkpoint file is created whether or not Checkpointing is turned on.

File names:

 You can use any combination of absolute and relative paths when specifying filenames to GenomeHistory.

Length information option:

 This option will cause GenomeHistory to print the length of each pair of duplicate genes found and the number of non-gap aligned residues for the pair. The option is useful if you wish to filter for partial duplicates.

Warnings option:

By default, any gene pair that do not meet the three analysis criteria after pairwise alignment are discarded without further warnings.  By specifying "Warnings=on", you can output WARNING entries in the error file that indicate why a particular pair was discarded.  This option, however, has the potential to make the error file very large.

Base Frequencies:

By default, GenomeHistory uses the observed base frequencies at each codon position in the calculation of Ks and Ka (in other words, the nucleotide composition at each codon position is averaged between the two sequences).  In some cases, it may be preferable to use the over-all average base composition (OBSERVED) which does not break down the composition by codon position.  It is also possible to use equal frequencies for all bases (EQUAL).


By default, GenomeHistory removes all gap positions in alignments before calculating Ks and Ka.   This is done because gap characters can tend to downwardly bias estimates of sequence diveragance (Yang, Z., 2000, Mol. Bio. Evol. 51:423-432).  Use eg=NO to turn off this feature.

Alignment Matrices:

By default, both BLAST and gh_module use the BLOSUM62 amino acid similarity matrix for sequence alignments.  The BLAST matrix can be changed with the "BLAST matrix location= " option, provided you have another alignment matrix in the appropriate format.  To use a different matrix in gh_module, contact the authors.

Comments on basic installation process:

To build and use the package, you will need a c and a c++ compiler (preferably the GNu gcc and g++) and
the NCBI implementation of BLAST.  The make command will build liblapack
(a partial lapack library), libf2c (because lapack was originally built in Fortran) and
gh_module (which handles the alignment, % difference and Ka/Ks computation)

You will need to add the GenomeHistory_2.0 directory to your path.  You can then run the program
by typing <filename> where <filename> is a input file with the parameters for the run.
An example input file with descriptions of each command is included at GenomeHistory_2.0/test_analysis.txt

Other platforms:

This package was developed and tested on a Mac OS X (10.5) system with the gcc compilers.  It should work on
other platforms, but has not been tested on them.  Use or edit the file GenomeHistory_2.0/ to
change the name of your c and c++ compiler.

Ignore within gene-list hits when searching

It can be desirable to define a list of genes to search with that is a subset of the total list (Gene File=) but not to allow hits between pairs of genes on this list. For instance, when comparing two genomes to find orthologs, one can define a gene list consisting of Genome #1 and omit hits between pairs of genes in that genome with the with the Ignore List Gene Hits=ON option (added 12/1/17).

Bug History:

7-29-02:  Fixed a bug that caused incorrect nucleotide sequences to be returned in genomes (such as Drosophila) where gene names can be substrings of each other (e.g. CT1187 and CT11871).  Pairs with this problem were still analyzed correctly, but also generated an error message.

10-27-03: Fixed a bug that resulted in some gene pairs where Ks = 0 (reported as Ks <10-4) being reported as having saturated in synonymous substitutions.

01-07-04: Fixed a bug preventing the "Ordered Gene list=YES" option from working properly when a "Gene File" was provided.

10-08-04: Fixed a bug which caused GenomeHistory to fail to create nucleotide alignments for calculating Ks and Ka when compiled with gcc 3.0.

10-17-04: Fixed a bug in the GenomeHistory perl script which prevented the analysis of gene pairs with E values of 0.0 when using some versions of WU-BLAST. Special thanks to Luke Hakes and Ignazio Carbone for help finding these last two bugs.

10-25-07: Fixed a bug which resulted in the nonstandard genetic codes not working properly. Also added the Mycoplasma genetic code to the code options

08-22-13: Fixed a bug which resulted in the "top BLAST hit" option not working with NCBI BLAST. Thanks for Seanna McTaggart for finding this one

03-21-16: Fixed a bug which resulted in identical nucleotide sequences having Ks reported as "NA"

11-5-18: Added a capacity to use the newer blastp program from NCBI rather than the older blastall -p blastp

Included c++ packages:

The actual computational work of GenomeHistory is performed by traditional c++ executables.  In addition to BLAST, this involves the gh_module program:

Structure of the c++ codes (Advanced users):

    The c++ program gh_module uses a common set of classes.  Of these, some of the more important are: