Clustering reduces data redundancy, improves base-call accuracy and transcript length, and can be used to determine gene representation within the library. NemaGene is a collection of all transcript assembly contigs (both sanger & 454 based) produced at The Genome Institute.
Most access into the NemaGene database comes from other tools within the Nematode.net site such as the contig links from NemaPath which directly jump to the contig details pages that are the terminus of a NemaGene search. But the NemaGene cluster search form can also be of use when you have identified a contig, isotig or gene from some other Nematode.net resources (eg. a pan-phylum NemaBLAST result or a cluster of interest from our FTP service) and want more detail on that sequence entity. Another common use would be identifying a stage specific set of isotigs/contigs for a given organism using the 'Stage' search selection.
Sanger EST Clustering
For details on our clustering method for Sanger ESTs, see McCarter JP, Dautova Mitreva M, Martin J, Dante T, Wylie T, Rao U,
Pape D, Bowers Y, Theising B, Murphy C, Kloek AP, Chiapelli BJ, Clifton SW, Bird MD and
Waterson RH (2003). Analysis and Functional Classification of Transcripts from the
nematode Meloidogyne incognita. Genome Biology, 4: R26: 1-19".
454 cDNA Clustering
Clustering for cDNA pyrosequencing reads were done using the Newbler transcriptome assembler pre-release version 2.5. The assembler uses the overlap layout consensus approach to build splicing graphs that can assemble alternatively spliced transcripts (or 'isotigs'). The parameters used are '-cdna -ml 100 -mi 95 -icl 30 -het', for 95% minimum identity over 100bp length with a minimum contig length of 30 to build isotigs.
NemaGene can be searched by stage, isogroup/cluster name, gene/isotig/contig name or cDNA read name on a per species basis. Enter your search term in the box labeled Enter search below:. Then select the appropriate settings from the Search Type: and Species Database: dropboxes. Be aware that if you do not set these dropbox menus your selection will most likely not be found. Click on the Search Clusters button to begin your search. All isotigs/contigs for the selected species of your requested type will be displayed. Some searches that return long lists of isotigs/contigs may take a long time to display.
NemaGene entries are annotated with InterPro ids (IPR), Gene Ontology terms (GO), Kegg Orthology identifiers (KO) and putative Chembl drug target ids. Associatons to IPR & GO ids are made using interproscan version 4.8 (running on INTERPRO version 32.0). KO annotations are assigned using a default WU-BLAST v2.0 alignment against the KEGG gene database (release 68.0), and putative Chembl drug targets are assigned using a WU-BLAST v2.0 alignment and reporting all hits to the Chembl db (release 18) meeting a cutoff of 40% identity over 75% of the length of the query (which is the nematode gene). Additionally, genes may be annotated as putative chokepoint enzymes. These are genes that were annotated with a KO that maps to a chokepoint enzyme in the KEGG v70 reaction database. A chokepoint enzyme catalyzes chokepoint reactions, which are defined as a reaction that produces or consumes a unique compound. Genes annotated as chokepoints may prove to be effective drug targets, given that blocking them may lead to over-abundances or shortages of unique substrates.
Clustering for NemaGene Meloidogyne incognita v 2.0
Clustering was performed by first building 'contigs' of ESTs with identical or nearly identical overlapping sequence and second, by bringing together related contigs to form 'clusters'. Contig member ESTs should all derive from identical transcripts whereas cluster members might derive from the same gene yet represent different transcript splice isoforms or transcripts from multigene families with extremely high sequence identity. The raw traces for submitted ESTs were base-called using Phred and assembled to form contigs using Phrap. Although Phrap is a program intended for genome assembly, it has been applied previously to ESTs with modifications. To determine initial assembly quality, the largest contigs were inspected using the assembly viewer Consed. Misassemblies bringing unrelated ESTs together into giant contigs usually resulted from the alignment of long poly(A) tails. To eliminate these assemblies of otherwise dissimilar ESTs, Phrap parameters (forcelevel 1, minmatch 20 and minscore 100) were adjusted and Phrap was rerun.
Once acceptable assembly parameters were obtained, Phrap was run to generate a first-draft assembly. Contigs with only one member EST (singletons) were removed from consideration until the trimming and cluster building stage. All contigs with more than three member ESTs was screened for misassemblies using Consed tools and newly written scripts. Misassemblies were recognized by: regions of high quality unaligned sequence; multiple runs of poly(A) and/or poly(T) (at least 15 nucleotides with no more than a one non-A/T base); internal poly(A) and/or poly(T) runs (> 50 nucleotides from either end of a contig and ≥ 15 or more nucleotides long with no more than one non-A/T base; internal stretches of low consensus quality (> 30 nucleotides from either end of a contig and ≥ 50 nucleotides where 90% of the nucleotides had a consensus quality below Phred 20). Contigs flagged for possible misassembly were manually edited in Consed and potentially chimeric ESTs and other suspect ESTs were identified and removed from the pool of traces. Chimerism can result from multiple-insert cloning or mistracking of sequence gel lanes. The project was reassembled with Phrap and screened again as above. All contigs with more than three members were examined again in Consed to eliminate additional misassemblies not resolved by the initial screens. In total, around 450 contigs were examined manually and around 200 were edited. For each contig, a consensus sequence of all EST members was generated. Contigs (now including singleton EST contigs) were then trimmed to high quality and any internal consensus position with a calculated quality value below 12 was changed to an N (unknown base).
Following the creation of contigs by Phrap, the contig consensus sequences were compared using WU-BLASTN (G = 2 E = 1 v = 100 F = F) and grouped on the basis of similarity to form clusters of related contigs. Contigs with overlaps of 100 bases or more with nucleotide-nucleotide identities of 93% or more were clustered together. For further analysis, new assemblies based on clusters were not formed; rather, each cluster retained all the consensus sequences of its contig members. NemaGene Meloidogyne incognita v 2.0 represents our second complete attempt at generating clusters for this species and is used as the basis for all subsequent analysis in this manuscript. Scripts have been written to allow the addition of new data while retaining the original contig and cluster naming scheme. Additional NemaGene versions of M. incognita will be built as additional ESTs become available for the species. A comparison of the NemaGene clustering approach to other EST clustering methods will be considered in a separate manuscript. NemaGene Meloidogyne incognita v 2.0 is available for searching at this [link].
Index of prefixes
In the NemaGene database, transcript contigs & isotigs are given a prefix that identifies their species of origin, as well as the sequencing platform the underlying data was generated upon. CDS sequences in NemaBLAST also use these prefix codes to indicate species. Here is an index of prefixes:
|isotig||An isotig is meant to be analogous to an individual transcript. Different isotigs from a given isogroup can be inferred splice-variants. The reported isotigs are the putative transcripts that can be constructed using overlapping reads provided as input to the assembler. Connections between contigs in an isogroup are represented by sequences (reads) that have alignments diverging consistently towards two or more different contigs (see Figure 91) or by a depth spike (188.8.131.52.1). Traversal from the start contig to the end contig or from the end contig to the start contig should yield the same but reverse-complemented isotig sequence. While many reads may contain poly-A tails, these tails are trimmed off prior to assembling the reads. Presently, the assembler ignores the fact that poly-A tails existed, so the orientation of reads in the assembly cannot be determined. Because of this lack of directionality, an isotig may be output as the reverse-complement of the biological transcript it represents. Contigs forming an isotig may be thought of as exons. This is not strictly correct, however, since untranslated regions (UTRs) and introns (in the case of primary transcripts) may exists in the reads generated from the sample.
|isogroup||An isogroup is a collection of contigs containing reads that imply connections between them. A discussion of the assembly process (see Section 1.1) explains how breaks can be introduced into the multiple alignments of overlapping reads, leading to branching structures between them. After attempting to resolve the branching structures, the Transcriptome Assembler groups all contigs whose branches could not be resolved into collections called isogroups. Using rules described in the following section, the assembler traverses the various paths through the contigs in an isogroup to produce the set of isotigs that gets reported. All possible paths through the contigs in an isogroup are traversed unless one or more thresholds is reached (see Section
|RNAseq||RNAseq refers to cDNA sequence data generated using next-generation, high-throughput sequencing technologies.