Friday, December 18, 2015

The isoform of Ube3a

The isoform abundances of Ube3a (in colon and brain cortex of adult mice)

Data 
wgEncodeCshlLongRnaSeqColonAdult8wksAlnRep1.bam
wgEncodeCshlLongRnaSeqColonAdult8wksAlnRep1V2.bam
wgEncodeCshlLongRnaSeqCortexAdult8wksAlnRep1.bam
wgEncodeCshlLongRnaSeqCortexAdult8wksAlnRep2.bam
Use Tophat for the transcriptome analysis and examined the isoform_exp.diff file 
mRNA_id         gene_id         locus                   fpkm_colon  fpkm_cortex
NM_001033962    Ube3a_isoform_3 chr7:66484119-66562097  0.004558    5.07055
NM_011668       Ube3a_isoform_2 chr7:66484119-66562097  0.230222    0.60659
NM_173010       Ube3a_isoform_1 chr7:66484119-66562097  5.82296     1.36267
It seems that iso3 of Ube3a is the main variant in the mouse cortex. The iso2 of Ube3a encodes the full length protein. Iso1 is considered to be E3-ligase deficient as it lacks 87 amino acids from the C-terminal HECT domain. Both iso1 and iso3 lack 21 amino acids from N-terminus as well. Regarding the localization, Iso1 and Iso2 are ubiquitously found throughout a cell, where Iso3 is confined to the nucleus.



Tuesday, December 1, 2015

MOOC Courses for Genetics/Genomics Data Analysis

I have taken a number of Massive open online course (MOOC) since 2014 and here is a short summary. 

In general, I believe learning-by-doing is the most efficient way, and learning a subject without applying the knowledge to solve any practical problem is unlikely resulting in a good understanding. That being said, I'm kind of opposed to learn things (I feel) which are less relevant to my current work, for example, I used to think it doesn’t make much sense to learn NGS analysis if I am not doing NGS study. However there is a dilemma that, in many circumstances, when facing a complex problem, you  need a certain level of skill/knowledge and be aware of existing tools available to use. Through this personal learning experience,  I could say I benefited quite a lot from the MOOC and I am happy that I invested my time in learning. These MOOC courses serve as a good staring point to build a broad knowledge base.

The most popular MOOC websites are Coursera, EdX, Stanford OpenEdX. The former two provide more courses on genetics/genomics/bioinforamatics; and in OpenEdX, some courses are not free. 

At Coursera, I obtained a certificate from courses including:
Johns Hopkins University  Regression Models 
Johns Hopkins University  Statistical Inference
Johns Hopkins University  The Data Scientist’s Toolbox
Johns Hopkins University  Reproducible Research
Johns Hopkins University  R Programming (highly recommended)
University of Michigan  Programming for Everybody (Python)

At EdX, the courses I finished with a certification:
HarvardX -  PH525x  Data Analysis for Genomics (highly recommended)
MITx -  6.00.1x   Introduction to Computer Science and Programming Using Python (highly recommended)

At OpenEdX, I took the course Statistics in Medicine but didn’t finish it. 

The courses I like the most are
this course covers almost a wide range of the genomics analysis and it is easy to follow the instruction. And you can always find something useful. From this course, I got to know the pheatmap , a handy tool to plot (elegant) heatmap, and now it becomes one of my favourite R package. 

very entertaining and I like the way how Prof Crimson taught. It is not only about how to program with Python, I learned more on computational thinking. 

R Programming - I skipped a lot lecture videos but enjoyed working on the assignments


The Quantitative Biology Workshop (7.QBWx), in my opinion, is not very focused and the transition from the learning material to the question is not always smooth. Although I understand that Matlab is widely used in the field of neuroscience, I do hope the course can adopt R or Octave over Matlab as the former two are free software. 



Wednesday, October 21, 2015

Autism Genetics and Open Science















“MSSNG”, a project which aims to sequence the whole-genome of 10,000 autistic individuals and make the data available to the public, was launched by Autism Speaks in collaboration with Google. The name of “MSSNG” is pronounced as “missing”, but has vowels omitted. This deliberate omission reflects the missing puzzle pieces of autism.

The address of the MSSNG website is www.mss.ng, which is easy to remember and hard to forget. This project will make a huge impact in the genetic studies of ASD given its unprecedented sample size and the open access policy. The huge volume of the sequencing data will be hosted in the Google Cloud Platform and can be analyzed with Google Genomics tool. MSSNG may lead to a major breakthrough in the understanding of the cause, the diagnosis and treatment of the autism. 

Tuesday, October 20, 2015

2015 ASHG annual meeting in Baltimore - Memo and Thoughts


I had the great honor to be a recipient of the traveling grant awarded by the Japanese Association for Propagation of the Knowledge of Genetics. With this generous support, I was able to attend the American Society of Human Genetics (ASHG) 2015 annual meeting held in Baltimore from Oct 6th to 10th – one of the most important and largest meetings for genetic research. The scientific programs were well organized and covered a wide range of topics from statistical genetics, population genetics, to fundamental research, and extend to clinical applications. The cutting-edge genome-editing tool Crispr-Cas9 was also highlighted in this meeting and there was a special session dedicated to the recent advances in this field; in recognize to the importance of this technology, the prestigious ASHG Gruber Genetics Prize was awarded to Prof. Emmanuelle Carpenter and Prof. Jennifer Doudna for their contribution to the discovery and application of the CRISPR-Cas9 system. In addition, with more than 6500 attendees including researchers, clinicians and vendors, the meeting also provided premium opportunities of extending network and establishing future collaborations.

I was inspired by a number of exciting studies and also greatly benefit from the stimulating discussion with other attendees. For me, the most impressive thing I learned is the utility of the high-throughput chromosome conformation capture (Hi-C) data. In the talk titled “The influence of structural variation on genomic integrity and gene regulation”, Dr. Malte Spielmann from Max Planck Institute for Molecular Genetics presented his remarkable research on the organization of the mega-base scale topologically associated domains (TADs) and functional consequence if the integrity of such genome architecture disrupted. The work “Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions” was published in the prestigious journal Cell (http://www.cell.com/cell/references/S0092-8674(15)00377-3). In this study, the researchers found that different forms of structure variations known as copy number variation (CNV) in the Epha4 locus lead to different types of limb malformations.  With the aid of the Hi-C data, the authors identified 3 TAD in this locus, they hypothesized that CNVs may disrupt local chromatin organization and change the enhancer-promoter interactions, leading to abnormal expression of the adjacent genes outside the original TAD - which advocates the concept that enhancer adoption might be a pathogenesis mechanism. By using CRISPR/Cas9 genome editing, they created mice with different chromosomal rearrangements found in human patients (the methodology part was published in Cell Report) and showed that if the CNV disrupted a CTCF-associated boundary domain, the gene located in the neighboring TAD will be unregulated by distal enhancer and thus lead to the abnormal limb formation. This study demonstrated the chromatin topology integrity is an essential component for understanding of the molecular mechanisms of pathogenesis especially related with large chromosomal variations. In another talk by Rao et al, an extremely high-resolution 3D maps of human and mouse genome was introduced and there is an accompanying software to visualize the intensive Hi-C data (http://www.aidenlab.org/juicebox).  
In addition to above, I found the talk “Epigenetic and transcriptional dysregulation of oxytocin receptor (Oxtr) in Tet1 methyl cytosine deoxygenate deficient mouse brain” quite interesting. In this talk, Dr. Tower discovered that Oxtr was among the top down-regulated genes in the hippocampus of Tet1-/- mice. Tet1 is a gene with pivotal role in the DNA demethylation in mammals. They further demonstrated that the down-regulation of Oxtr was mediated by the hypermethylation of the CpG island (CGI) located within Oxtr exon 3 in Tet1-/- mice rather than CGI in the promoter region. While CGI hypermethylation was not observed in ESCs, hypermethylation of exon 3 of Oxtr was detected as early as E14.5. This suggests TET1 is necessary for preventing hypermethylation of Oxtr within the first few days post conception in mice. Given the critical role of Oxtr in social and maternal behavior, they went on to the behavior test and observed impaired maternal care in virgin Tet1-/- female mice, as evidenced by a longer latency to pup retrieval and less time spent huddling with the pups.

In the poster session, basically I visited all posters that have the keywords of either “autism” or “CNV” in the abstracts. One of the interesting presentations is No. 3123F “Whole-exome sequencing identifies a novel 2.5 kb duplication in INSR in a patient with Donohue syndrome”. The mutations in gene Insulin Receptor (INSR) was known to cause Donohue syndrome - a rare disorder characterized by severe insulin resistance. However, for several patients of Donohue syndrome from the same family, no mutation was found after standard Sanger sequencing of the whole INSR gene. To search for other pathogenic mutation, the whole exome sequencing (WES) was conducted but still no plausible mutation was identified. At this situation, the authors performed CNV calling and found a 2.5 kb micro-duplication spanning exon 10-11 of the exact causal gene INSR. Further analysis revealed this duplication caused the frame-shift of the coding sequence and resulted in a premature stop codon. To summarize, for WES, it is recommended to search potential CNV when no promising results obtained from SNV analysis. In another presentation No. 3138 titled “comprehensive comparative performance analysis of high-resolution array platforms for genome-wide CNV detection in humans”, I was surprised to know Affymetrix 6.0 chip outperforms CytoScan, a chip designed solely for CNV analysis. In poster No. 1755, Kaviar, a comprehensive public catalog of human variant and genotype frequency was demonstrated and is accessible at http://db.systemsbiology.net/kaviar. This tool combines 31 public data sources and 4622 private whole genome sequences. It integrates genome variation data from 77,238 unrelated individuals, including the 1000 Genomes Project's data, UK10K COHORT allele frequencies representing 3781 individuals, the Exome Aggregation Consortium (ExAC) 63,000 exomes, and 808 whole genomes from the Alzheimer's Disease Neuroimaging Initiative (ADNI). In short, it provides a one-stop query engine when one needs to look up the allele frequency of the rare variant.

I also participated one poster walk “Genome Structure, Variation, and Function” led by Prof. Manolis Dermitzakis. He discussed three selected posters and shared his insights into how genetic variants exerts the influence on gene expression level. I personally found No. 3173F intriguing. In this comprehensive study of gene regulating variation, the authors evaluated the variation’s influence on distal epigenetic modification, mRNA stability, transcription and translation rate, and ribosome occupancy. They found that as many as 30% of all QTLs that affect protein expression levels do not appear to affect chromatin-level traits. Instead, they tend to modulate gene expression levels directly by affecting splicing and/or RNA decay.

My personal reflection on this year’s ASHG is that with the trends towards higher-resolution, higher throughput data (Hi-C, Encode and whole exome/genome data from thousands of samples), and the availability of the genome-editing tool to manipulate the genome in cell/animal level. Many challenging biological hypotheses now can be tested with computational, statistical, experimental methods and will in turn lead to a better understanding the genetic mechanism of the biological process such as development and aging, and the pathogenesis mechanism of diseases.

Tuesday, September 29, 2015

Docker for Bioinformatics and Genetics - Part 3

Docker Tutorial Part 3: Build a docker image which contains Birdsuite for CNV analysis

The installation of Birdsuite is quite tricky. The software package is a mix of Java, Python, R, and Matlab programs. Several programs are already obsoleted and need extra tweaking. Here I created a docker image which has the Birdsuite installed. The procedure to build the image is available at this github repo.
To get a copy of this image, simple by typing
docker pull psytky03/birdsuite
To use it, first create a container and get inside:
docker run -ti psytky03/birdsuite /bin/bash
Run with the test data:
cd /test_data
/birdsuite/birdsuite.sh --basename=test \
--chipType=GenomeWideSNP_6 \
--outputDir=output \
--genderFile=test.gender \
--celFiles=test.cels \
--noLsf \
--apt_probeset_summarize.force
Wait the program to finish ;-)
That is all, if without docker, It will take me probably a whole afternoon to install the package on another linux PC. I need to care and worry about all dependencies and enviroment, now if I have the docker image I can simplily run it everywhere. I think docker is really powerful and clears a big problem for reproducible research.

Additional:
The Dockerfile used to build the image
FROM ubuntu:14.04

MAINTAINER psytky03 <psytky03@gmail.com>

USER root
ENV bs /birdsuite

RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
python \
build-essential \
make \
gcc \
openjdk-6-jre \
r-base-core \
r-base-dev \
python-numpy \ 
python-setuptools \
bc \
libxp6 \
&& apt-get clean \
&& apt-get autoremove 

# Download birdsuite_executable files
RUN mkdir $bs \
&& wget http://www.broadinstitute.org/ftp/pub/mpg/birdsuite/birdsuite_executables_1.5.5.tgz \
&& tar -zxvf birdsuite_executables_1.5.5.tgz -C $bs \
&& rm birdsuite_executables_1.5.5.tgz \
&& rm $bs/birdsuite.sh $bs/run_birdseye.sh


RUN wget https://dl.dropboxusercontent.com/u/964493/additional.tar.gz \
&& tar -zxvf additional.tar.gz \
&& rm additional.tar.gz \
&& mv METADATADIR.tar.gz $bs/ \
&& tar -zxvf $bs/METADATADIR.tar.gz -C $bs \
&& rm $bs/METADATADIR.tar.gz \
&& mv addon/* $bs/ \
&& chmod 755 $bs/MCRInstaller.75.glnxa64.bin \
$bs/apt-probeset-summarize.64 \
$bs/birdsuite.sh \
$bs/run_birdseye.sh \
&& $bs/MCRInstaller.75.glnxa64.bin -P bean421.installLocation="/birdsuite/MCR75_glnxa64" -silent \
&& rm $bs/MCRInstaller.75.glnxa64.bin \
&& mkdir test_data data \
&& tar -zxvf birdsuite_inputs_1.5.5.tgz -C test_data \
&& rm birdsuite_inputs_1.5.5.tgz


# Install Python tools
WORKDIR $bs 

RUN sudo python install.py -e /usr/bin/easy_install 
RUN sudo easy_install birdsuite-1.0-py2.5.egg ; exit 0 
RUN sudo easy_install mpgutils-0.7-py2.5.egg ; exit 0 
RUN sudo ln -s /usr/local/bin/* . \
&& rm *.egg \
&& rm *.py


# Install R packages 
#
RUN wget --no-check-certificate https://cran.r-project.org/src/contrib/mclust_5.0.2.tar.gz \
&& sudo R CMD INSTALL mclust_5.0.2.tar.gz \
&& rm mclust_5.0.2.tar.gz 

RUN tar -xvf broadgap.utils_1.0.tar.gz \
\
# Fix broadgap.utils
&& cd broadgap.utils \
&& rm -r man \
&& cd .. \
&& R CMD build broadgap.utils \
\
# Fix broadgap.cnputils
&& tar -xvf broadgap.cnputils_1.0.tar.gz \
&& cd broadgap.cnputils/ \
&& echo 'exportPattern( "." )' > NAMESPACE \
&& cd .. \
&& R CMD build broadgap.cnputils \
\
# Fix broadgap.canary
&& tar -xvf broadgap.canary_1.0.tar.gz \
&& cd broadgap.canary/ \
&& echo 'exportPattern( "." )' > NAMESPACE \
&& cd .. \
&& R CMD build broadgap.canary \
&& rm -r broadgap.utils broadgap.cnputils broadgap.canary \
&& R CMD INSTALL -l $bs  broadgap.utils_1.0.tar.gz \
&& R CMD INSTALL -l $bs  broadgap.cnputils_1.0.tar.gz \
&& R CMD INSTALL -l $bs  broadgap.canary_1.0.tar.gz \
&& rm -f broadgap.utils_1.0.tar.gz broadgap.cnputils_1.0.tar.gz broadgap.canary_1.0.tar.gz

WORKDIR /

Monday, September 28, 2015

Docker for Bioinformatics and Genetics - Part 2

Docker Tutorial Part 2: Basci Commands and Running Rstudio Server with Docker

1. Basic docker commands

  • docker ps shows curretently running container
  • docker ps -a
    This command will show all containers available in the system. You will see there are quite a number of containers accuumlated. 
psytky03@ubuntu:~$ docker ps -a 
CONTAINER ID        IMAGE                   COMMAND                 
73af7edafa08        psytky03/eigandplink    "plink --file data/to"  
5d37644676f5        psytky03/eigandplink    "plink --file data/to"  
f85336ec937f        psytky03/eigandplink    "eigenstrat"            
0bd8bff7af23        psytky03/eigandplink    "bash"                  
08f6a6f41583        psytky03/eigandplink    "bash"                  
c3e1e1645e1a        psytky03/eigandplink    "plink --help"          
cc592b883668        psytky03/eigandplink    "plink -v"              
f9873e947389        ubuntu                  "bash"                  
43c5be566760        hello-world             "/hello"               
*docker rm CONTAINER_ID to delete unused constainer
psytky03@ubuntu:~$ docker rm 73af7edafa08 
73af7edafa08
  • docker run -rm to automatically delete the container after it exited 
psytky03@ubuntu:~$ docker run -ti --rm psytky03/eigandplink /bin/bash
root@44914cfcfd57:/# exit
exit

#Now let's try to delete this container with id 44914cfcfd57

psytky03@ubuntu:~$ docker rm 44914cfcfd57
Error response from daemon: no such id: 44914cfcfd57
Error: failed to remove containers: [44914cfcfd57]
  • docker images to show all images in the system
psytky03@ubuntu:~$ docker images
REPOSITORY             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
psytky03/eigandplink   latest              6fd94c3b6e6e        38 hours ago        392.7 MB
ubuntu                 latest              91e54dfb1179        5 weeks ago         188.4 MB
hello-world            latest              af340544ed62        7 weeks ago         960 B
  • docker rmi IMAGE_ID to delete the image. Noticed that if a container is created fromt the image, need to either delete the container or force delete.
psytky03@ubuntu:~$ docker rmi af340544ed62
Error response from daemon: Conflict, cannot delete af340544ed62 because the container 43c5be566760 is using it, use -f to force
Error: failed to remove images: [af340544ed62]


psytky03@ubuntu:~$ docker rm 43c5be566760
43c5be566760

psytky03@ubuntu:~$ docker rmi af340544ed62
Untagged: hello-world:latest
Deleted: af340544ed62de0680f441c71fa1a80cb084678fed42bae393e543faea3a572c
Deleted: 535020c3e8add9d6bb06e5ac15a261e73d9b213d62fb2c14d752b8e189b2b912
  • docker pull to fetch the image from the dockerhub. 

2. Pull Rstudio Server image from DockerHub

The Rstudio server image can be fetched from Dockerhub at this link rocker/rstudio.

Since rocker/rstudio is not avilable in this linux system now, if we use docker run rocker/rstudio, docker will by default look for this image in dockerhub and download it this image can be found. 
Here we use docker pull to fetch it first
docker pull rocker/rstudio
psytky03@ubuntu:~$ docker pull rocker/rstudio
Using default tag: latest
latest: Pulling from rocker/rstudio

acdec9ec413b: Pull complete 
37cbf6c3413f: Pull complete 
7fc523e9982e: Pull complete 
359b980a72ed: Pull complete 
01c4f8f74e82: Pull complete 
46ac2c9f1a72: Pull complete 
2c218622f9d2: Pull complete 
a01f119d835b: Pull complete 
a5688ed074d1: Pull complete 
1e631da06610: Pull complete 
27866ec542bc: Pull complete 
ac88bcc53e01: Pull complete 
061029545a15: Pull complete 
be5e2022c62b: Pull complete 
776ce8e1ace5: Pull complete 
5608458f8ff4: Pull complete 
0948d1536ce1: Pull complete 
88b0e66f47b7: Pull complete 
62c92c006919: Pull complete 
20c2eeb79e8b: Pull complete 
7b226db5341a: Pull complete 
88c331104d7a: Pull complete 
eafb6b5866c1: Pull complete 
e19a7c84d5f1: Pull complete 
Digest: sha256:488f7c427a970b1d11424e8c8a01269e132d41ee1913a6bfadb97c4105fc2719
Status: Downloaded newer image for rocker/rstudio:latest
Confirm it with docker images
psytky03@ubuntu:~$ docker images
REPOSITORY             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
psytky03/eigandplink   latest              6fd94c3b6e6e        39 hours ago        392.7 MB
rocker/rstudio         latest              e19a7c84d5f1        4 days ago          1.343 GB
ubuntu                 latest              91e54dfb1179        5 weeks ago         188.4 MB

3. Run Rstudio Server

Here we use docker run -d -p 8888:8787 rocker/rstudio to create a container and execute it.
The -d indicate we run it in background mode like a web application.

The -p tells docker how we are going to map port between the host and container. Rstudio Server use 8787 as its default port and here we map it to 8888 in the host. 
psytky03@ubuntu:~$ docker run -d -p 8888:8787 rocker/rstudio
0d1a38402d5ecd68e18276e5e2cd979e21f255e3c2b3525b9388c9a20501fd6e
Have a check of the host Ip address at eth0, my IP is 172.16.118.129
Type the address 172.16.118.129:8888 in the web browser of the Mac and you will see this login window
Rstudio
log in the system
The username and password are both rstudio
Now let's install ggplot2 package
install.packages("ggplot2")
library("ggplot2")
ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
awesome, seems everything is running great.
Now let's stop this container using docker stopcommand, but first we need to get the container ID with docker ps
psytky03@ubuntu:~$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             
0d1a38402d5e        rocker/rstudio      "/usr/bin/supervisord"   20 minutes ago      

psytky03@ubuntu:~$ docker stop 0d1a38402d5e
0d1a38402d5e
Start it again
psytky03@ubuntu:~$ docker start 0d1a38402d5e
0d1a38402d5e
Everything is back again. If we want have the ggplot2 library installed in the image file. we need to use the commit command. 
To add the container into the image with docker commit command
psytky03@ubuntu:~$ docker commit -m "add ggplot2" 0d1a38402d5e rocker/rstudio
28024380329aa1f8fa23499fce8ea3f56233d19eb48a711111584cd4ecff7285
It is also possible to show all the intermediate layer information of this image. Similar to snapshot of virtual machine, we can run container from any state by telling docker the image ID. 
psytky03@ubuntu:~$  docker history rocker/rstudio
IMAGE               CREATED             CREATED BY                                    
28024380329a        2 minutes ago       /usr/bin/supervisord -c /etc/supervisor/conf. 
e19a7c84d5f1        4 days ago          /bin/sh -c #(nop) CMD ["/usr/bin/supervisord" 
eafb6b5866c1        4 days ago          /bin/sh -c #(nop) EXPOSE 8787/tcp             
88c331104d7a        4 days ago          /bin/sh -c mkdir -p /var/log/supervisor   &&  
7b226db5341a        4 days ago          /bin/sh -c #(nop) COPY file:8294dfdc8c50662db 
20c2eeb79e8b        4 days ago          /bin/sh -c #(nop) COPY file:df4e748b71577b0ec 
62c92c006919        4 days ago          /bin/sh -c #(nop) COPY file:674055700f4c6fbee 
88b0e66f47b7        4 days ago          /bin/sh -c usermod -l rstudio docker   && use 
0948d1536ce1        4 days ago          /bin/sh -c echo '\n\n# Configure httr to perf 
5608458f8ff4        4 days ago          /bin/sh -c rm -rf /var/lib/apt/lists/   && ap 
776ce8e1ace5        4 days ago          /bin/sh -c #(nop) ENV LANG=en_US.UTF-8        
be5e2022c62b        4 days ago          /bin/sh -c #(nop) ENV PATH=/usr/lib/rstudio-s 
061029545a15        4 days ago          /bin/sh -c #(nop) MAINTAINER "Carl Boettiger  
ac88bcc53e01        2 weeks ago         /bin/sh -c #(nop) CMD ["R"]                   
27866ec542bc        2 weeks ago         /bin/sh -c apt-get update                          
1e631da06610        2 weeks ago         /bin/sh -c #(nop) ENV R_BASE_VERSION=3.2.2    
a5688ed074d1        2 weeks ago         /bin/sh -c echo "deb http://http.debian.net/d 
a01f119d835b        2 weeks ago         /bin/sh -c #(nop) ENV LANG=en_US.UTF-8        
2c218622f9d2        2 weeks ago         /bin/sh -c #(nop) ENV LC_ALL=en_US.UTF-8      
46ac2c9f1a72        2 weeks ago         /bin/sh -c echo "en_US.UTF-8 UTF-8" >> /etc/l 
01c4f8f74e82        2 weeks ago         /bin/sh -c apt-get update                          
359b980a72ed        2 weeks ago         /bin/sh -c useradd docker                          
7fc523e9982e        2 weeks ago         /bin/sh -c #(nop) MAINTAINER "Carl Boettiger  
37cbf6c3413f        2 weeks ago         /bin/sh -c #(nop) CMD ["/bin/bash"]           
acdec9ec413b        2 weeks ago         /bin/sh -c #(nop) ADD file:9101a1e6d928e77435     

Saturday, September 26, 2015

Docker for Bioinformatics and Genetics - Part 1

Docker Tutorial 1: Palying with Docker to deploy genetics software

This is a tutorial of using docker to set up bioinforamatics and genetics analysis tools. 

1. First let's install the docker

The installation procedures can be found in Docker's offical userguide for Ubuntu
I am running the Ubuntu 14.04.03 LTS server version in VMware Fusion Pro. First confirm the Linux Kernel version with uname -r and my return is 3.19.0-25-generic. Which will be fine since the prerequirement is version higher than 3.10
The curl is preinstalled in the system, otherwise use 
sudo apt-get install curl
 to get one. 
The installation of Docker is done in one step
curl -sSL https://get.docker.com/ | sh
Verify the installation from the log message
...
cgroup-lite start/running
Setting up docker-engine (1.8.2-0~trusty) ...
docker start/running, process 3512
Processing triggers for libc-bin (2.19-0ubuntu6.6) ...
Processing triggers for ureadahead (0.100.0-16) ...
+ sudo -E sh -c docker version
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64

If you would like to use Docker as a non-root user, you should now consider
adding your user to the "docker" group with something like:

  sudo usermod -aG docker psytky03

Remember that you will have to log out and back in for this to take effect!
Create a user groud named docker and add the user to this group.
sudo usermod -aG docker psytky03
Logout and back again.
Run a HelloWorld test image
docker run hello-world
Here is the output 
Hello from Docker.
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker Hub account:
 https://hub.docker.com

For more examples and ideas, visit:
 https://docs.docker.com/userguide/
docker run -it ubuntu bash

2. Sign up a DockerHub account

Dockerhub is similar to the concept of github for pushing and pulling docker images
https://hub.docker.com/login/
Here I got a username as "psytky03"

3. Make the first Dockerfile

Dockerfile is the blueprint to tell docker how to create an image. It contains the basic information such as which platform the image is based on, a full collection of the commands for installation of the software, the system path et.c 
Here I am going to install two tools: the latest Eigensoft ver 6.01 for pricinple component analysis (PCA) and Plink ver 1.9. 
The contents of this dockerfile looks like this
FROM ubuntu

MAINTAINER Psytky03
RUN sudo apt-get update
RUN sudo apt-get -y install wget git unzip
RUN sudo apt-get -y install libgsl0ldbl gfortran-4.4
RUN git clone https://github.com/DReichLab/EIG.git

RUN sudo apt-get -y install wget unzip python
RUN wget https://www.cog-genomics.org/static/bin/plink150903/plink_linux_x86_64.zip
RUN unzip plink_linux_x86_64.zip -d plinkbin


ENV PATH $PATH:/EIG/bin:/plinkbin
RUN mkdir data

4. Build the Docker image

mkdir my_first_docker_image
cd my_first_docker_image/

nano Dockerfile   
#Copy and Paste contents list in section 3

docker build -t psytky03/eigandplink .
It will take for a while, finally the log report shows the image is built without error
Successfully built 87f7aefa7968
Check the built image with docker images command
docker images 
------------------------------------------------------------------------------------------
REPOSITORY             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
psytky03/eigandplink   latest              6fd94c3b6e6e        3 minutes ago       392.7 MB
ubuntu                 latest              91e54dfb1179        5 weeks ago         188.4 MB
hello-world            latest              af340544ed62        7 weeks ago         960 B
The size of this image is 392 MB
Verify the Eigensoft and Plink
docker run psytky03/eigandplink plink --help
docker run psytky03/eigandplink eigenstrat

# To get into the image file
docker run -it psytky03/eigandplink bash

# To check all docker image/container info
docker ps -a
Run Plink with data in the host machine
#First let's grab some test files shipped with Plink

wget https://www.cog-genomics.org/static/bin/plink150903/plink_linux_x86_64.zip
unzip plink_linux_x86_64.zip -d plink1.9
Now we use docker run -v to bridge the folder in the host machine to the data folder in the container:
docker run -v /home/psytky03/plink1.9:/data psytky03/eigandplink \
plink --file data/toy --make-bed --out data/test 
Check the plink1.9 folder and you should be able to see the test.bed test.bim test.fam files. 

5. Push the image to DockerHub

docker login
#Login Succeeded

docker push psytky03/eigandplink

6. Pull back the image at another Linux machine

docker pull psytky03/eigandplink