Deep transcriptome sequencing (RNA-seq) is able to recover those information that may be missed by previous array-based technologies, due to its high sensitivity, high throughput nature and more importantly, no prior knowledge of transcript sequence is needed.
It has been extensively used to discover new genes, novel splicing isoforms or disease related chimeras (such as gene fusion).
Target RNA sequencing reveals that the range, depth and complexity of human transcriptome is far from fully characterized; many novel genes, new isoforms, rare transcripts remain undiscovered.
Unprecedented sequencing capacity provided by next generation sequenicing (NGS) platform make it possible to identify these "dark matters", nevertheless, how to annotate these "new genes" is even more important and there is no such tools available.
A New gene Annotation process involves filtering out false positives, predicting coding potential and integration with other knowledge. genCAT is designed to fulfill these tasks, it is an open platform that allow users to incorporate as many datasets (concepts) as possible to annotate the input gene list, as long as these datasets are prepared in bigwig , BED, BAM/SAM formats.
These specific file formats are very well-known and flexible enough to accommodate to different kinds of NGS data (RNA-seq, ChIP-seq, DNA methylation, SNPs etc).
Installation: To install genCAT on your Mac you will have to open a Terminal window, browse to genCAT's folder and run the following command from inside that directory (you will need administrator privileges to be able to run it):
sudo python setup.py install
Here are some key features of "genCAT":
· Automatically recognize RNA-seq experiments. Pair-end or single-end, strand-specific or not. If strand-specific, automatically determine how paired reads were stranded and calculate strand specificity.
· Automatically recognize SAM or BAM files.
· Precisely determine coding status of newly identified genes or isoforms.
· precisely define the ORF (Open Reading Frame) region for protein coding gene.
· quickly associate known concepts (epegenetics markers, SNPs etc) to newly identified genes or isoforms.