Zhao, S., & Zhang, B. (2015). A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics, 16(1), 97–14. doi:10.1186/s12864-015-1308-8, In which the authors evaluated the impact of using different gene annotation set on RNA-Seq analysis.
As the authors states: "To a broader extent, one of the most practical questions researchers want to know in advance is: if different gene models are chosen for RNA-Seq data analysis, what is the chance of obtaining the same quantification result for a given gene?"
Below are notes I took from this paper and other resources.
- RefSeq human gene models are well supported and broadly used in various studies. The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis
- The UCSC Known Genes dataset is based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from GenBank, and serves as a foundation for the UCSC Genome Browser.
- Vega genes are manually curated transcripts produced by the HAVANA group at the Welcome Trust Sanger Institute, and are merged into Ensembl.
- Ensembl genes contain both automated genome annotation and manual curation, while the gene set of GENCODE corresponds to Ensembl annotation since GENCODE version 3c (equivalent to Ensembl 56).
- AceView provides a comprehensive non- redundant curated representation of all available human cDNA sequences."
21958 Genes were shared by all Ensmebl, RefGene and UCSC Knowngenes
Among the 21,958 common genes, about 20% of genes had no expression at all in both annotations. Identical counts were obtained for only 16.3% of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and among them, 9.3% of genes (equivalent to 2038) differed by 50% or greater. As shown in Table 1 and Figure 5, the choice of a gene model had a large impact on gene quantification.
Why does the choice of a gene model have so dramatic an effect on gene quantification? Below, we chose a few extreme or representative cases to provide possible explanations. In the liver sample, the expression levels for these exemplary genes for both Ensembl and RefGene were summarized in Table 2 (read length = 75 bp). PIK3CA (phosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha) uses ATP to phosphorylate PtdIns, PtdIns4P, and PtdIns(4,5)P2. In the liver sample, there were 1094 reads mapped to PIK3CA in Ensembl annotation, while only 492 reads were mapped in RefGene. The PIK3CA gene definition in both Ensembl and RefGene, and the mapping profile of RNA-Seq reads were shown in Figure 6. Clearly, the difference in gene definition gives rise to the observed discrepancy in quantification.
It was suggested that when conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation, such as RefGene, might be preferred. When conducting more exploratory research, a more complex genome annotation, such as Ensembl, should be chosen. Based upon our experience of RNA-Seq data analysis, we recommend using RefGene annotation if RNA-Seq is used as a replacement for a microarray in transcriptome profiling.