Nematode Net

NemaGene Clustering

Clustering reduces data redundancy, improves base-call accuracy and transcript length,
and can be used to determine gene representation within the library. For details on our
clustering method, see McCarter JP, Dautova Mitreva M, Martin J, Dante T, Wylie T, Rao U,
Pape D, Bowers Y, Theising B, Murphy C, Kloek AP, Chiapelli BJ, Clifton SW, Bird MD and
Waterson RH (2003). Analysis and Functional Classification of Transcripts from the
nematode Meloidogyne incognita. Genome Biology, 4: R26: 1-19".

Clustering for NemaGene Meloidogyne incognita v 2.0

Clustering was performed by first building 'contigs' of ESTs with identical or nearly identical overlapping sequence and second, by bringing together related contigs to form 'clusters'. Contig member ESTs should all derive from identical transcripts whereas cluster members might derive from the same gene yet represent different transcript splice isoforms or transcripts from multigene families with extremely high sequence identity. The raw traces for submitted ESTs were base-called using Phred [87] and assembled to form contigs using Phrap (P. Green, personal communication). Although Phrap is a program intended for genome assembly, it has been applied previously to ESTs with modifications [90]. To determine initial assembly quality, the largest contigs were inspected using the assembly viewer Consed [91]. Misassemblies bringing unrelated ESTs together into giant contigs usually resulted from the alignment of long poly(A) tails. To eliminate these assemblies of otherwise dissimilar ESTs, Phrap parameters (forcelevel 1, minmatch 20 and minscore 100) were adjusted and Phrap was rerun.

Once acceptable assembly parameters were obtained, Phrap was run to generate a first-draft assembly. Contigs with only one member EST (singletons) were removed from consideration until the trimming and cluster building stage. All contigs with more than three member ESTs was screened for misassemblies using Consed tools and newly written scripts. Misassemblies were recognized by: regions of high quality unaligned sequence; multiple runs of poly(A) and/or poly(T) (at least 15 nucleotides with no more than a one non-A/T base); internal poly(A) and/or poly(T) runs (> 50 nucleotides from either end of a contig and ≥ 15 or more nucleotides long with no more than one non-A/T base; internal stretches of low consensus quality (> 30 nucleotides from either end of a contig and ≥ 50 nucleotides where 90% of the nucleotides had a consensus quality below Phred 20). Contigs flagged for possible misassembly were manually edited in Consed and potentially chimeric ESTs and other suspect ESTs were identified and removed from the pool of traces. Chimerism can result from multiple-insert cloning or mistracking of sequence gel lanes. The project was reassembled with Phrap and screened again as above. All contigs with more than three members were examined again in Consed to eliminate additional misassemblies not resolved by the initial screens. In total, around 450 contigs were examined manually and around 200 were edited. For each contig, a consensus sequence of all EST members was generated. Contigs (now including singleton EST contigs) were then trimmed to high quality and any internal consensus position with a calculated quality value below 12 was changed to an N (unknown base).

Following the creation of contigs by Phrap, the contig consensus sequences were compared using WU-BLASTN (G = 2 E = 1 v = 100 F = F) [92,93] and grouped on the basis of similarity to form clusters of related contigs. Contigs with overlaps of 100 bases or more with nucleotide-nucleotide identities of 93% or more were clustered together. For further analysis, new assemblies based on clusters were not formed; rather, each cluster retained all the consensus sequences of its contig members. NemaGene Meloidogyne incognita v 2.0 represents our second complete attempt at generating clusters for this species and is used as the basis for all subsequent analysis in this manuscript. Scripts have been written to allow the addition of new data while retaining the original contig and cluster naming scheme. Additional NemaGene versions of M. incognita will be built as additional ESTs become available for the species. A comparison of the NemaGene clustering approach to other EST clustering methods will be considered in a separate manuscript. NemaGene Meloidogyne incognita v 2.0 is available for searching at [94] and FTP at [95].