Coding: How to Perform Burden Analysis for Targeted Next Generation Sequencing (NGS) study using RAREMETAL

At the beginning

This is a quick tutorial to show how to use RareMetal to do a burden anlysis for a targeted resequencing project. In this project, a single gene was resequenced on 105 patients and 312 healthy controls, and 7 rare variants were identified. Due to the low-freuency of these rare variants, single variant association test is unlikely to produce statistical meaningful results. Alternatively we could do a burden analysis to test whehter there is an excessive burden of rare variants at the gene level.

Data (rare variant occurance)

SNP	Case	Control
SNP1	3	4
SNP2	2	4
SNP3	1	0
SNP4	1	0
SNP5	1	0
SNP6	1	0
SNP7	1	0

Note: all carriers are hetero

Installation of software

I used an Ubuntu 14.04 PC for the analysis. It should work for other Linux distributions as well.
1. Download the precompiled RareMetal Linux executable Files

#Install RareMetal and RareMetalWork
mkdir temp_app
cd temp_app
wget http://genome.sph.umich.edu/w/images/b/bb/Raremetal.linux.executable.4.13.6.tgz -O raremetal.tgz
tar zxvf raremetal.tgz
cd linux_executables
sudo cp * /usr/local/bin/
cd ..

2. Install PedStats (For Ped File Intergrity check)

wget http://csg.sph.umich.edu/abecasis/PedStats/download/Linux-pedstats.tar.gz -O pedstats.tar.gz
tar zxvf pedstats.tar.gz  
cd pedstats-0.6.12
sudo cp pedstats /usr/local/bin

3. Install Tabix and bgzip

# Download tabix from the sourceforge 
# http://sourceforge.net/projects/samtools/files/tabix/

wget wget http://downloads.sourceforge.net/project/samtools/tabix/tabix-0.2.6.tar.bz2 -O tabix.tar.bz
tar xvjf tabix.tar.bz2 
cd tabix-0.2.6
make
sudo cp tabix /usr/local/bin
sudo cp bgzip /usr/local/bin
cd ../../

Preparation of the data files

1. Ped file
First we need to make the Ped file which will contain the genotype information. I made the Excel spreedsheet downlaod from here
There are 6 + 7 * 2 = 20 columns
the first 6 columns are:

family_id
individual_id
father_id
mother_id
gender (1 for male; 2 for female)
disease status (1 for unaffected; 2 for affected)

the remaing columns are simple genotype (e.g. G G)
Export the file as Tab Delimited Text file named "geneA.ped"
2. Dat file
Then we need to make another data file which specify the disease name and variant information. open a text editor and save the belowing contents.

T Autism
M 1:8798395
M 1:8800614
M 1:8801278
M 1:8802373
M 1:8809426
M 1:8809671
M 1:8809715

T is for trait name (corresponding to the sixth column of the ped file)
M is for marker information
1:8798395 represents the location of the marker (chromosome 1 and position 8798385)
save this file as "geneA.dat"

Run the Program

CAUTION: Change the new line type from "CR" to "LF"
step 1: check the intergrity of the ped files

pedstats -p geneA.ped -d geneA.dat

The part of the output is shown as below

MARKER GENOTYPE STATISTICS
==========================

                    [Genotypes]      [Founders]     Hetero
      1:8798395      417 100.0%      417 100.0%       1.7%
      1:8800614      417 100.0%      417 100.0%       1.4%
      1:8801278      417 100.0%      417 100.0%       0.2%
      1:8802373      417 100.0%      417 100.0%       0.2%
      1:8809426      417 100.0%      417 100.0%       0.2%
      1:8809671      417 100.0%      417 100.0%       0.2%
      1:8809715      417 100.0%      417 100.0%       0.2%
          Total     2919 100.0%     2919 100.0%       0.6%

Total markers: 7

step 2: create the covariate and score files

raremetalworker  --ped geneA.ped --dat geneA.dat --prefix raremetal

Confirm the process

Friendly reminder:  if your file has a lot of individuals not genotyped, the following process might take very long ...  done.
Analyzing trait "Autism" ...
  Matching individuals in phenotype and genotypes ...
    Found 417 phenotyped AND genotyped individuals from 417 families.
  done.

  Start fitting linear model ...   done.
  Fitting model used 0.0 minutes.
  Genomic Control for additive model is: 6.53196
completed.

Single variant statistics stored in 
  raremetal.Autism.singlevar.score.txt

Variance-covariance matrix stored in 
  raremetal.Autism.singlevar.cov.txt

QQ and Manhattan plots saved in 
  raremetal.Autism.plots.pdf


Log file listing current options stored in raremetal.singlevar.log

Three files were generated

raremetal.Autism.singlevar.cov.txt
raremetal.Autism.singlevar.score.txt
raremetal.singlevar.log

step 3: run raremetal

bgzip raremetal.Autism.singlevar.score.txt
tabix -c "#" -s 1 -b 2 -e 2 raremetal.Autism.singlevar.score.txt.gz
bgzip raremetal.Autism.singlevar.cov.txt
tabix -c "#" -s 1 -b 2 -e 2 raremetal.Autism.singlevar.cov.txt.gz

make a file named "covfiles" which contains line below

raremetal.Autism.singlevar.cov.txt.gz

make a file named "summaryfiles" which contain line below

raremetal.Autism.singlevar.score.txt.gz

make a file named "group.txt" which contain line below

GeneA 1:8798395:G:A 1:8800614:T:C 1:8801278:G:A 1:8802373:G:A 1:8809426:G:T 1:8809671:G:A 1:8809715:G:A

raremetal --summaryFiles summaryfiles --covFiles covfiles --groupFile group.txt --SKAT --burden --MB --VT --prefix burden

Check the process

RAREMETAL 4.13.6 -- A Tool for Rare Variants Meta-Analyses for Quantitative Traits
          (c) 2012-2013 Shuang Feng, Dajiang Liu, Goncalo Abecasis


Please go to "http://genome.sph.umich.edu/wiki/RAREMETAL" for the newest version.

Options:
       List of Studies : --summaryFiles [summaryfiles], --covFiles [covfiles]
      Grouping Methods : --groupFile [group.txt], --annotatedVcf [],
                         --annotation [], --writeVcf
            QC Options : --hwe [1.0e-05], --callRate [0.95]
   Association Methods : --burden [ON], --MB [ON], --SKAT [ON], --VT [ON],
                         --condition []
         Other Options : --labelHits, --geneMap [../data/refFlat_hg19.txt],
                         --correctGC, --prefix [burden], --maf [0.05],
                         --longOutput, --tabulateHits, --dosage,
                         --hitsCutoff [1.0e-06]
             PhoneHome : --noPhoneHome, --phoneHomeThinning [100]

Analysis started at: Wed Apr 15 16:55:24 2015

Pooling summary statistics from study 1 ...
  done

Performing Single variant meta analysis ...
  Genomic Control for all studies are:
  6.53195   
  done.


Checking if all groups are analyzed...

    done!
Reading cov matrix from study 1 ...
done
Performing burden tests ...
  done.

Performing MB tests ...
  done.

Performing Variable Threshold tests ...
Done.

Performing SKAT ...
Done.


QQ plots and manhattan polts have been saved in burden.meta.plots.pdf.

Analysis ends at: Wed Apr 15 16:55:24 2015

Analysis took 1 seconds (~ 0.0 hours).

Check the final results

open the file "burden.meta.burden.results"

##Method=Burden
##STUDY_NUM=1
##TotalSampleSize=417
#GROUPNAME  NUM_VAR VARs    AVG_AF  MIN_AF  MAX_AF  EFFECT_SIZE PVALUE
geneA   7   1:8798395:G:A;1:8800614:T:C;1:8801278:G:A;1:8802373:G:A;1:8809426:G:T;1:8809671:G:A;1:8809715:G:A   0.00308325  0.00119904  0.00839329  0.317461    0.00240259

The last value is the burden P-Value.

Coding

Pages

Wednesday, April 15, 2015

How to Perform Burden Analysis for Targeted Next Generation Sequencing (NGS) study using RAREMETAL