Methods

This is a list of some of the methods that were used to produce alignments, identify, score and analyze SNPs in TcSNP.

Data Sources
The reference genome sequence of the CL Brener strain of T. cruzi (TcVI) was obtained from GenBank using the umbrella accession number AAHK00000000 (July, 2005) [1]. The draft genome sequence of the Sylvio X10 strain (TcI) was obtained from GenBank using the umbrella accession number ADWP00000000 [2]. The transcriptome of the Adriana strain (TcI) was kindly provided by Martín Vazquez (INDEAR, Argentina). The draft genomic sequence of the JR cl4 (TcI) and the Esmeraldo cl3 (TcII) strains were obtained from the TriTrypDB resource, and were kindly provided by Dr. Gregory Buck, Virginia Commonwealth University, and The Genome Center, Washington University School of Medicine. Other T. cruzi sequences (mRNAs, ESTs) were also obtained from GenBank using custom Entrez queries (May, 2007). Before loading into the database, some curation has been done to standardize the names of T. cruzi strains (because of the variations how different authors write the names of strains in GenBank submissions and publications). Redundancy is still present, and further curation of strain names can be done. We encourage users of the database to send feedback about this issue.
Clustering and aligning sequences
Before clustering all sequences were masked against a library of vector sequences and T. cruzi repetitive elements, as described previously [3]. Annotated coding sequences from the reference genome, and other publicly available sequences were mapped against the genome scaffolds using BLAT. Sequences mapping to the same genomic regions were clustered together and multiple sequence alignments were obtained using phrap. But because allelic variants in the CL-Brener genome were separated during assembly, those initial clusters showed many instances of allelic variants separated into different alignments. To obtain alignments between allelic variants, we merged alignments with highly similar consensus sequences (by BLAST analysis). Afterwards, and based on user feedback we have also merged, splitted and re-analyzed many alignments. This manual curation effort was mainly focused on single copy genes. User of TcSNP should also be aware of the fact that many sequences from the CL-Brener genome assembly may represent assembly artifacts. In the database we have attached similar notes to the corresponding alignments to help users in the interpretation of the SNP data.
Candidate SNP identification and analysis
Multiple sequence alignments were scanned to identify polymorphic columns. To calculate the probability of these sites being true polymorphisms as opposed to sequencing errors, we have used the software package PolyBayes, version 5 [4]. PolyBayes, uses a Bayesian statistical framework that relies on allele frequency, alignment depth, and base quality values amongst other attributes to calculate a probability score. Because chromatogram trace data is not available for many of the sequences in this release, we have devised a scoring strategy that uses arbitrary base quality values. These quality values are different depending on the sequence origin/type. Sequence bases obtained from the T. cruzi CL-Brener genome (∼ 19X shotgun coverage) were arbitrarily assigned a base quality value of 40; those from GenBank records, a value of 30 (individual submissions); and those from dbEST, a value of 20 (single-pass, unedited). Using this scoring scheme, a single base from an EST differing from two allelic variants of CL-Brener reference sequence (depth = 3) would give a probability of 0.22 of being a true SNP (see for example SNP 4028216). To analyze the effect of each SNP on the corresponding protein product, we anoted the codon position of the SNP in each reference coding sequence and evaluated the change introduced by the polymorphic base. Also, for a subset of the alignments (those containing coding sequences of similar length, with indels being a multiple of 3) we calculated dN and dS values[5] using BioPerl's population genetics modules [6].

[1] El-Sayed N, Myler P, Bartholomeu DC, et al. (2005) The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease. Science, 309(5733), 409–415
[2] Franzén O, Ochaya S, Sherwood E, Lewis MD, Llewellyn MS, Miles MA, Andersson B. (2011) Shotgun Sequencing Analysis of Trypanosoma cruzi I Sylvio X10/1 and Comparison with T. cruzi VI CL Brener. PLoS Negl Trop Dis 5(3), e984.
[3] Aguero, F., Verdun, R. E., Frasch, A. C., and Sanchez, D. O. (Dec, 2000) A random sequencing approach for the analysis of the Trypanosoma cruzi genome: general structure, large gene and repetitive DNA families, and gene discovery. Genome Res, 10(12), 1996–2005.
[4] Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H., Stitziel, N. O., Hillier, L., Kwok, P. Y., and Gish, W. R. (Dec, 1999) A general approach to single-nucleotide polymorphism discovery. Nat Genet, 23(4), 452–456.
[5] Hartl, D. L. and Clark, A. G. (2007) Principles of population genetics, Sinauer Associates, Inc., Sunderland, MA, USA 4th edition.
[6] Stajich, J. E. and Hahn, M. W. (Jan, 2005) Disentangling the effects of demography and selection in human history. Mol Biol Evol, 22(1), 63–73.