GenomeHistory 2.1 Documentation

Revision History:

9-29-05: Added a feature (ol=ON) which allows GenomeHistory to print the lengths of duplicate pairs found and the number of aligned residues for the pair.

10-1-13: Version 2.0 of GenomeHistory. New features include:

Use of NCBI BLAST and not WashU BLAST
Removal of the dependancy on CLUSTALW
Merging of all c++ functions into the single "gh_module" program
GenomeHistory no longer produces temporary files in the main directory
Local alignments are now possible
It is now possible to include gap characters in the analysis

12-1-17: When searching a list of genes, it is now possible to ignore hits to other genes on that list (il=ON).

11-30-24: Version 2.1 of GenomeHistory. Primarily a bug-fix for newer versions of BLAST so that BLAST hits are not missed by GenomeHistory.

Downloading GenomeHistory 2.1

The current release of GenomeHistory is avaliable from our website

Installation:

See the INSTALL file included with GenomeHistory for installation instructions.

Citing GenomeHistory:

Conant, G. C. and Wagner, A. (2002) GenomeHistory: A software tool and its application to fully sequenced genomes, Nucleic Acids Research, 30(15): 3378-3386 [Reprint]

Running GenomeHistory:

GenomeHistory can be invoked either with command-line options or through the use of a command file. An example command file is included. The command-line options can be obtained by typing % GenomeHistory_2.pl with no options. Possible options are:

Long Option	Short Option	Example	Default
Protein sequences	pi	`=<protein sequences file>`	`orf_trans.fasta`
ORF sequences	ni	`=<nucleotide sequences file>`	`orf_coding.fasta`
Output file	o	`=<output file>`	`genome_ks.txt`
Error file	e	`=<error file>`	`genome_ks.err`
Gene file	gf	`=<Gene name file>`	`NONE`
Gene Name Field	gn	`=<Gene name field #>`	`2`
BLAST matrix location	bl	= `/usr/local/wublast/BLOSUM62`	`NONE`
Only Top BLAST match	tb	`=YES/NO`	`NO`
BLAST threshold	bt	`=0.001`	`0.01`
Minimum ORF Translation length	mo	`=55`	`40`
Minimum Number Aligned Residues	ma	`=45`	`40`
Percent identity threshold	pt	`=0.70`	`0.50`
Output Length Information	ol	`=ON/OFF`	`OFF`
Warnings	w	`=ON/OFF`	`OFF`
Genetic code	gc	`=<U/VM/YM/IM/MM/CN/EM>`	`U (Universal)`
Ordered Gene List	og	`=YES/NO`	`NO`
Base Frequencies	bf	`=<CODON/OBSERVED/EQUAL>`	`CODON`
Ignore List Gene Hits	il	`=ON/OFF`	`OFF`
Checkpoint	cp	`=ON/OFF`	`OFF`
Alignment method	am	`=<GLOBAL/LOCAL>`	`GLOBAL`
Exclude Gaps	eg	`=<YES/NO>`	`YES`

Any combinations of options may be used. When used on the command line, only the short option format is allowed. The command file interprets the same set of options, but allows either format. Note that in both cases the equal (=) sign is required. Options are not case-sensitive (but filenames may be).

Example command-line invocation:

% GenomeHistory_2.pl pa=my_genome.fasta na=my_genome_nuc.fasta bl=/usr/local/wublast/BLOSUM62

Example command file invocation:

% GenomeHistory_2.pl genome_analysis.txt

Generating the example dataset:

GenomeHistory 2 includes a small example dataset (a portion of the yeast genome) that can be used to test your installation of GenomeHistory. From the GenomeHistory_2.1 directory, type (% is your prompt)
% GenomeHistory_2.pl test_analysis.txt
This will create two new files, test_ks.txt and test_ks.err. These files should be identical to the example files provided, example_ks.txt and example_ks.err. You can compare them using the UNIX "diff"command:
% diff test_ks.txt example_ks.txt
% diff test_ks.err example_ks.err

In theory, "diff" should not print anything. However, if you are using a different version of clustalw (not 1.82), there may be (mostly minor) differences.

Notes and Tips:

Locating nucleotide sequences corresponding to protein sequences:

One of the chief difficulties with the sort of genome analysis performed by GenomeHistory is matching nucleotide sequences to protein sequences. By default, GenomeHistory 2 does this using keywords in the FASTA header files (see the Gene Name Field option). Unfortuntately, some genomes do not provide identifiers that allow one to one matching between protein sequences and their corresponding nucleotide sequences. This problem can be overcome with a second matching method if the genome's ORFs and protein sequences are listed in a one to one order (meaning protein sequence number 1 is the translation of nucleotide sequence number 1 etc.). This second method is invoked with the Ordered gene list = yes option and simply uses the order that genes are listed in the two files to identify nucleotide sequences. Microbial genomes seem particularly suited for use with this option, as alternative splicing and gene identification are less of a problem for these genomes.

Using the Genename field to locate nucleotide sequences:

If you are experiencing numerous errors where GenomeHistory is not able to find nucleotide sequences for genes, you may want to experiment with the "Gene Name Field =" option. Sequences in the nucleotide files may have slightly different names than those in the protein sequence file, but the unquie identifiers should match. For instance, the yeast genome stores unique names in field #2, while Arabidopsis uses field 3. Note that this option will only be effective if there is a unquie identifier common to both protein and nucleotide sequences (for instance the YXX### format yeast gene identifiers). If you cannot find such an identifier, you may want to experiment with the Ordered gene list = yes option described above.

Runtimes:

The running time of GenomeHistory_2 is in large part dictated by the number of BLAST hits retained for each gene. In general, the lower (smaller E-value) the BLAST threshold is to, the fewer gene pairs will be analyzed and the faster the analysis will be. The trade-off, of course, is the potential to miss significant hits. We have found values around E < 0.00001 to be quite conservative for the yeast genome.

BLAST in non-default locations:

BLAST expects to be installed at /usr/ncbi/blast. If this is not where you have installed BLAST, you will need to use the "BLAST matrix location" option to locate your installation. For instance bl=/usr/local/wublast/BLOSUM62

Genetic Codes:

The Genetic code option takes on the following possible values:

U: Universal
VM: Vertebrate mitochondrial
YM: Yeast mitochondrial
MM: Mold mitochondrial
IM: Invertebrate mitochondrial
CN: ciliate nuclear
EM: Echinoderm mitochondrial.

Using the "Only Top BLAST match=YES" option:

When this option is used, K_s and K_a is calculated for each gene and its top BLAST hit, no matter what the E-value of that hit. Thus, the "BLAST threshold" option is meaningless in this context.

Repeated analysis of gene pairs:

Gene pairs could potentially represented (and analyzed) in two ways (Gene 1, Gene 2) and (Gene 2, Gene 1). Under normal circumstances, GenomeHistory will only analyze one of these two combinations (the first one that occurs). However if the option "Only Top BLAST match = YES" is set, or if the option "Gene file = <filename>" is used, both comparisions may be made, since these two options may be used in analyses where it is not desirable to exclude reciprocal comparisions.

Identical Sequences:

Each sequence pair for which K_s/K_a are calculated is first checked to be sure that the two sequences are not identical. If they are, the word IDENTICAL is entered in the output rather than values.

Saturation in K_s values:

If the divergence time between two gene pairs is sufficiently long, there may no longer be sufficient information in the two sequences to estimate K_s. Such sequences are said to be saturated. Unfortunately, numerical algorithms for estimating K_s under likelihood cannot directly determine that sequences are saturated. Instead we have implemented a simple test: After the likelihood maximization is complete, the K_svalue for the pair is multiplied by 10 and the likelihood is recalculated. If this likelihood is as high or higher than the original likelihood, that indicates that the sequence pair has saturated K_s, and that K_s value in the output is marked with an *. Of course, this test does not address the problem of the large variances associated with high, but unsaturated, K_s values. Researchers will want to remember this issue when selecting duplicate pairs for analyses. A forthcoming manscript will discuss this issue in more detail (Hahn, Conant and Wagner, Journal of Molecular Evolution, in press)

Checkpointing:

When the "Checkpoint" option is ON, GenomeHistory will look for a file called .GenomeHistory.chkpnt.(argfile) (where argfile is the command file) or .GenomeHistory.chkpnt.cmdline (if invoking from the command line). If this file is found, GenomeHistory will restart from the point listed in this file. This option is useful for stopping jobs and restarting them at a later time. For checkpointing to work properly, you must specify the same output and error filenames on the restart as were originally used.
Note that the checkpoint file is created whether or not Checkpointing is turned on.

File names:

You can use any combination of absolute and relative paths when specifying filenames to GenomeHistory.

Length information option:

This option will cause GenomeHistory to print the length of each pair of duplicate genes found and the number of non-gap aligned residues for the pair. The option is useful if you wish to filter for partial duplicates.

Warnings option:

By default, any gene pair that do not meet the three analysis criteria after pairwise alignment are discarded without further warnings. By specifying "Warnings=on", you can output WARNING entries in the error file that indicate why a particular pair was discarded. This option, however, has the potential to make the error file very large.

Base Frequencies:

By default, GenomeHistory uses the observed base frequencies at each codon position in the calculation of Ks and Ka (in other words, the nucleotide composition at each codon position is averaged between the two sequences). In some cases, it may be preferable to use the over-all average base composition (OBSERVED) which does not break down the composition by codon position. It is also possible to use equal frequencies for all bases (EQUAL).

Gaps:

By default, GenomeHistory removes all gap positions in alignments before calculating K_s and K_a. This is done because gap characters can tend to downwardly bias estimates of sequence diveragance (Yang, Z., 2000, Mol. Bio. Evol. 51:423-432). Use eg=NO to turn off this feature.

Alignment Matrices:

By default, both BLAST and gh_module use the BLOSUM62 amino acid similarity matrix for sequence alignments. The BLAST matrix can be changed with the "BLAST matrix location= " option, provided you have another alignment matrix in the appropriate format. To use a different matrix in gh_module, contact the authors.

Comments on basic installation process:

To build and use the package, you will need a c and a c++ compiler (preferably the GNu gcc and g++) and
the NCBI implementation of BLAST. The make command will build liblapack
(a partial lapack library), libf2c (because lapack was originally built in Fortran) and
gh_module (which handles the alignment, % difference and Ka/Ks computation)

You will need to add the GenomeHistory_2.1 directory to your path. You can then run the program
by typing GenomeHistory_2.pl <filename> where <filename> is a input file with the parameters for the run.
An example input file with descriptions of each command is included at GenomeHistory_2.1/test_analysis.txt

Other platforms:

This package was developed and tested on a Mac OS X (10.5) system with the gcc compilers. It should work on
other platforms, but has not been tested on them. Use configure.pl or edit the file GenomeHistory_2.1/make.inc to
change the name of your c and c++ compiler.

Ignore within gene-list hits when searching

It can be desirable to define a list of genes to search with that is a subset of the total list (Gene File=) but not to allow hits between pairs of genes on this list. For instance, when comparing two genomes to find orthologs, one can define a gene list consisting of Genome #1 and omit hits between pairs of genes in that genome with the with the Ignore List Gene Hits=ON option (added 12/1/17).

Bug History:

7-29-02: Fixed a bug that caused incorrect nucleotide sequences to be returned in genomes (such as Drosophila) where gene names can be substrings of each other (e.g. CT1187 and CT11871). Pairs with this problem were still analyzed correctly, but also generated an error message.

10-27-03: Fixed a bug that resulted in some gene pairs where K_s = 0 (reported as K_s <10^-4) being reported as having saturated in synonymous substitutions.

01-07-04: Fixed a bug preventing the "Ordered Gene list=YES" option from working properly when a "Gene File" was provided.

10-08-04: Fixed a bug which caused GenomeHistory to fail to create nucleotide alignments for calculating K_s and K_a when compiled with gcc 3.0.

10-17-04: Fixed a bug in the GenomeHistory perl script which prevented the analysis of gene pairs with E values of 0.0 when using some versions of WU-BLAST. Special thanks to Luke Hakes and Ignazio Carbone for help finding these last two bugs.

10-25-07: Fixed a bug which resulted in the nonstandard genetic codes not working properly. Also added the Mycoplasma genetic code to the code options

08-22-13: Fixed a bug which resulted in the "top BLAST hit" option not working with NCBI BLAST. Thanks for Seanna McTaggart for finding this one

03-21-16: Fixed a bug which resulted in identical nucleotide sequences having Ks reported as "NA"

11-5-18: Added a capacity to use the newer blastp program from NCBI rather than the older blastall -p blastp

12-20-23: Fixed a bug where genes in a gene list that were not in the sequence file generated spurious matches

Included c++ packages:

The actual computational work of GenomeHistory is performed by traditional c++ executables. In addition to BLAST, this involves the gh_module program:

gh_module: This program performs a protein alignment, calculates its %ID. Then, if that %ID passes the threshold, it computes the nucleotide alignment and Ka and Ks

Structure of the c++ codes (Advanced users):

The c++ program gh_module uses a common set of classes. Of these, some of the more important are:

Exchange: By analogy to a telephone exchange, this class is the central repository of global parameters that all of the other classes need to know about. It mostly consists of public variables and interface functions to those variables. Improper settings in this class can give strange results in programs. For instance, it is important to set to model of evolution in the exchange, as certain Boolean flags must be set correctly for codon-based models to work.
Sequence_dataset: A simple data structure class that holds protein or nucleotide sequences. It is primarily intended for aligned sequences (so that each sequence is of the same length).
Read_Sequence: This is actually the base class for a group of classes designed to read various sequence file formats. Currently supported formats are:

PIR
FASTA
PAUP NEXUS
PHYLIP (Interleaved only)

Like_model: This is the base class for the likelihood calculations. Models of evolution are implemented as derived classes. All models are multiply inherited, with one side of the inheritance determining whether the model is nucleotide based or codon-based and the other side describing the model of nucleotide substitution (for instance, Kimura 2-parameter).
Tree: This class implements a bifurcating tree data structure used for phylogenetic trees. In GenomeHistory, the only tree used is a degenerate two-tip tree.
Genetic_Code: This is the base class that implements different genetic codes. To be compatible with some of my earlier codes, the base class Genetic_code functions equivalently to the derived Univ_genetic_code