Thursday, April 30, 2015

Difference Between Refseq, Ensembl, UCSC Gene Annotation

I came across this interesting paper:
Zhao, S., & Zhang, B. (2015). A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics, 16(1), 97–14. doi:10.1186/s12864-015-1308-8, In which the authors evaluated the impact of using different gene annotation set on RNA-Seq analysis.

As the authors states: "To a broader extent, one of the most practical questions researchers want to know in advance is: if different gene models are chosen for RNA-Seq data analysis, what is the chance of obtaining the same quantification result for a given gene?"

Below are notes I took from this paper and other resources.

  • RefSeq human gene models are well supported and broadly used in various studies. The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis
  • The UCSC Known Genes dataset is based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from GenBank, and serves as a foundation for the UCSC Genome Browser. 
  • Vega genes are manually curated transcripts produced by the HAVANA group at the Welcome Trust Sanger Institute, and are merged into Ensembl. 
  • Ensembl genes contain both automated genome annotation and manual curation, while the gene set of GENCODE corresponds to Ensembl annotation since GENCODE version 3c (equivalent to Ensembl 56). 
  • AceView provides a comprehensive non- redundant curated representation of all available human cDNA sequences."

21958 Genes were shared by all Ensmebl, RefGene and UCSC Knowngenes

Among the 21,958 common genes, about 20% of genes had no expression at all in both annotations. Identical counts were obtained for only 16.3% of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and among them, 9.3% of genes (equivalent to 2038) differed by 50% or greater. As shown in Table 1 and Figure 5, the choice of a gene model had a large impact on gene quantification.

Why does the choice of a gene model have so dramatic an effect on gene quantification? Below, we chose a few extreme or representative cases to provide possible explanations. In the liver sample, the expression levels for these exemplary genes for both Ensembl and RefGene were summarized in Table 2 (read length = 75 bp). PIK3CA (phosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha) uses ATP to phosphorylate PtdIns, PtdIns4P, and PtdIns(4,5)P2. In the liver sample, there were 1094 reads mapped to PIK3CA in Ensembl annotation, while only 492 reads were mapped in RefGene. The PIK3CA gene definition in both Ensembl and RefGene, and the mapping profile of RNA-Seq reads were shown in Figure 6. Clearly, the difference in gene definition gives rise to the observed discrepancy in quantification. 
It was suggested that when conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation, such as RefGene, might be preferred. When conducting more exploratory research, a more complex genome annotation, such as Ensembl, should be chosen. Based upon our experience of RNA-Seq data analysis, we recommend using RefGene annotation if RNA-Seq is used as a replacement for a microarray in transcriptome profiling. 

WiMAX 2+ Speed Test In Japan (Inside and Outside Tokyo)

I signed a two-year WiMAX 2+ contract in February before the government’s new data usage regulation take effect. Starting from 2015/02/19, according to the new policy, the connection speed will be limited to 128 kbps if the monthly data usage exceed 7GB.
The contract terms include:
  • monthly fee: 3696 yen
  • unlimited data plan without speed restriction (if you did not use over 3GB in recent 3 days)
  • cashback will be offered (15100 yen) after 10 months usage.
  • renewal every 2 years (cancelation fee, 1st year:19,000; second year: 14,000)
  • eligible to use AU LTE 4G network with additional charge
Although the price is not the lowest from, @Nifty seems to be much a reliable provider than GMOBB (see my terrible experience with GMOBB here).
The WiMAX router came with the plan is Speed Wi-Fi Next W01 (manufactured by Huawei), a so-called ultra-high speed router which provides maximum download speeds of 220 Mbps and upload speeds of 10 Mbps.
In addition, the router can be manually switched to a hybrid 4G LTE mode, when in areas with no or poor WiMAX coverage (In this case, an additional 1005 yen monthly fee will be charged).
After 2 months of personal usage, I could say I am basically satisfied with the service.The speed is ok for daily use, although far below the maximum speed WiMAX claimed. The LTE function is helpful when in Metros and other areas.
Below are some speed test data in different locations - my home, office, train station, Shinkansen, and resort hotel. I used an iPhone App “OOKLA SpeedTest” for this and the test results might be affected by my iPhone's specs etc..)
The WiMax is not stable in the train of shinkansen; and in the resort hotel of Hamanako, which is far away from the city, there is no WiMAX signal at all. For these two locations, only LTE data were availale.
data <- read.csv("~/Desktop/2015-05-06-WiMAX-Speed-Test.csv", header=T,
data[,c(4,5)] <- data[,c(4,5)]/1000
se <- function(x) sqrt(var(x)/length(x))
result <- ddply(data,~Location+ConnType,summarise,down_mean=mean(Download),down_se=se(Download),up_mean=mean(Upload),up_se=se(Upload))
result[,-c(1,2)] <-  round(result[,-c(1,2)],digits=2) 
##      Location ConnType down_mean down_se up_mean up_se
## 1    Hamanako      LTE     12.76    0.70    0.59  0.09
## 2        Home      LTE     10.67    0.68    2.00  0.32
## 3        Home    WiMAX      9.03    0.98    2.15  0.16
## 4         Lab      LTE     21.02    1.74    2.93  0.31
## 5         Lab    WiMAX     13.99    1.17    8.40  0.30
## 6  ShinKansen      LTE      7.59    1.57    1.14  0.44
## 7     Station      LTE     10.29    1.71    3.18  1.00
## 8     Station    WiMAX      6.38    1.03    1.23  0.10
## 9   Tobu_Line      LTE     11.98    1.53    2.69  0.45
## 10  Tobu_Line    WiMAX      7.89    1.66    1.45  1.09
Make a plot for the data
ggplot(result, aes(x=Location, y=down_mean, fill=ConnType)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=down_mean-down_se, ymax=down_mean+down_se),
                  width=.2, position=position_dodge(.9))
ggplot(result, aes(x=Location, y=up_mean, fill=ConnType)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=up_mean-down_se, ymax=up_mean+down_se),
                  width=.2, position=position_dodge(.9))
Cautions for the Data Usage:
If the total amount of LTE 4G communications exceeds 7GB in a month, the maximum communications speed for sending and receiving until the end of that month will be 128kbps. (The communication speed constraint will be removed on the first day of the following month.) If you apply for the Extra Option, the communication speed constraint does not apply.
To avoid network congestion, if you are using a 4G LTE smartphone, a limit will be placed on communications speed for a whole day if you have used more than 3GB over the three most recent days (not including the day on which the restriction is applied). (Restrictions also apply for customers subscribed to the Extra Option.)
If the communication charges become high, service may be suspended temporarily.