At the beginning
This is a quick tutorial to show how to use RareMetal to do a burden anlysis for a targeted resequencing project. In this project, a single gene was resequenced on 105 patients and 312 healthy controls, and 7 rare variants were identified. Due to the low-freuency of these rare variants, single variant association test is unlikely to produce statistical meaningful results. Alternatively we could do a burden analysis to test whehter there is an excessive burden of rare variants at the gene level.Data (rare variant occurance)
SNP | Case | Control |
---|---|---|
SNP1 | 3 | 4 |
SNP2 | 2 | 4 |
SNP3 | 1 | 0 |
SNP4 | 1 | 0 |
SNP5 | 1 | 0 |
SNP6 | 1 | 0 |
SNP7 | 1 | 0 |
Installation of software
I used an Ubuntu 14.04 PC for the analysis. It should work for other Linux distributions as well.1. Download the precompiled RareMetal Linux executable Files
#Install RareMetal and RareMetalWork
mkdir temp_app
cd temp_app
wget http://genome.sph.umich.edu/w/images/b/bb/Raremetal.linux.executable.4.13.6.tgz -O raremetal.tgz
tar zxvf raremetal.tgz
cd linux_executables
sudo cp * /usr/local/bin/
cd ..
2. Install PedStats (For Ped File Intergrity check) wget http://csg.sph.umich.edu/abecasis/PedStats/download/Linux-pedstats.tar.gz -O pedstats.tar.gz
tar zxvf pedstats.tar.gz
cd pedstats-0.6.12
sudo cp pedstats /usr/local/bin
3. Install Tabix and bgzip # Download tabix from the sourceforge
# http://sourceforge.net/projects/samtools/files/tabix/
wget wget http://downloads.sourceforge.net/project/samtools/tabix/tabix-0.2.6.tar.bz2 -O tabix.tar.bz
tar xvjf tabix.tar.bz2
cd tabix-0.2.6
make
sudo cp tabix /usr/local/bin
sudo cp bgzip /usr/local/bin
cd ../../
Preparation of the data files
1. Ped fileFirst we need to make the Ped file which will contain the genotype information. I made the Excel spreedsheet downlaod from here
There are 6 + 7 * 2 = 20 columns
the first 6 columns are:
- family_id
- individual_id
- father_id
- mother_id
- gender (1 for male; 2 for female)
- disease status (1 for unaffected; 2 for affected)
Export the file as Tab Delimited Text file named "geneA.ped"
2. Dat file
Then we need to make another data file which specify the disease name and variant information. open a text editor and save the belowing contents.
T Autism
M 1:8798395
M 1:8800614
M 1:8801278
M 1:8802373
M 1:8809426
M 1:8809671
M 1:8809715
T is for trait name (corresponding to the sixth column of the ped file)M is for marker information
1:8798395 represents the location of the marker (chromosome 1 and position 8798385)
save this file as "geneA.dat"
Run the Program
CAUTION: Change the new line type from "CR" to "LF"step 1: check the intergrity of the ped files
pedstats -p geneA.ped -d geneA.dat
The part of the output is shown as belowMARKER GENOTYPE STATISTICS
==========================
[Genotypes] [Founders] Hetero
1:8798395 417 100.0% 417 100.0% 1.7%
1:8800614 417 100.0% 417 100.0% 1.4%
1:8801278 417 100.0% 417 100.0% 0.2%
1:8802373 417 100.0% 417 100.0% 0.2%
1:8809426 417 100.0% 417 100.0% 0.2%
1:8809671 417 100.0% 417 100.0% 0.2%
1:8809715 417 100.0% 417 100.0% 0.2%
Total 2919 100.0% 2919 100.0% 0.6%
Total markers: 7
step 2: create the covariate and score filesraremetalworker --ped geneA.ped --dat geneA.dat --prefix raremetal
Confirm the processFriendly reminder: if your file has a lot of individuals not genotyped, the following process might take very long ... done.
Analyzing trait "Autism" ...
Matching individuals in phenotype and genotypes ...
Found 417 phenotyped AND genotyped individuals from 417 families.
done.
Start fitting linear model ... done.
Fitting model used 0.0 minutes.
Genomic Control for additive model is: 6.53196
completed.
Single variant statistics stored in
raremetal.Autism.singlevar.score.txt
Variance-covariance matrix stored in
raremetal.Autism.singlevar.cov.txt
QQ and Manhattan plots saved in
raremetal.Autism.plots.pdf
Log file listing current options stored in raremetal.singlevar.log
Three files were generatedraremetal.Autism.singlevar.cov.txt
raremetal.Autism.singlevar.score.txt
raremetal.singlevar.log
step 3: run raremetalbgzip raremetal.Autism.singlevar.score.txt
tabix -c "#" -s 1 -b 2 -e 2 raremetal.Autism.singlevar.score.txt.gz
bgzip raremetal.Autism.singlevar.cov.txt
tabix -c "#" -s 1 -b 2 -e 2 raremetal.Autism.singlevar.cov.txt.gz
make a file named "covfiles" which contains line belowraremetal.Autism.singlevar.cov.txt.gz
make a file named "summaryfiles" which contain line belowraremetal.Autism.singlevar.score.txt.gz
make a file named "group.txt" which contain line belowGeneA 1:8798395:G:A 1:8800614:T:C 1:8801278:G:A 1:8802373:G:A 1:8809426:G:T 1:8809671:G:A 1:8809715:G:A
raremetal --summaryFiles summaryfiles --covFiles covfiles --groupFile group.txt --SKAT --burden --MB --VT --prefix burden
Check the processRAREMETAL 4.13.6 -- A Tool for Rare Variants Meta-Analyses for Quantitative Traits
(c) 2012-2013 Shuang Feng, Dajiang Liu, Goncalo Abecasis
Please go to "http://genome.sph.umich.edu/wiki/RAREMETAL" for the newest version.
Options:
List of Studies : --summaryFiles [summaryfiles], --covFiles [covfiles]
Grouping Methods : --groupFile [group.txt], --annotatedVcf [],
--annotation [], --writeVcf
QC Options : --hwe [1.0e-05], --callRate [0.95]
Association Methods : --burden [ON], --MB [ON], --SKAT [ON], --VT [ON],
--condition []
Other Options : --labelHits, --geneMap [../data/refFlat_hg19.txt],
--correctGC, --prefix [burden], --maf [0.05],
--longOutput, --tabulateHits, --dosage,
--hitsCutoff [1.0e-06]
PhoneHome : --noPhoneHome, --phoneHomeThinning [100]
Analysis started at: Wed Apr 15 16:55:24 2015
Pooling summary statistics from study 1 ...
done
Performing Single variant meta analysis ...
Genomic Control for all studies are:
6.53195
done.
Checking if all groups are analyzed...
done!
Reading cov matrix from study 1 ...
done
Performing burden tests ...
done.
Performing MB tests ...
done.
Performing Variable Threshold tests ...
Done.
Performing SKAT ...
Done.
QQ plots and manhattan polts have been saved in burden.meta.plots.pdf.
Analysis ends at: Wed Apr 15 16:55:24 2015
Analysis took 1 seconds (~ 0.0 hours).
Check the final results
open the file "burden.meta.burden.results"##Method=Burden
##STUDY_NUM=1
##TotalSampleSize=417
#GROUPNAME NUM_VAR VARs AVG_AF MIN_AF MAX_AF EFFECT_SIZE PVALUE
geneA 7 1:8798395:G:A;1:8800614:T:C;1:8801278:G:A;1:8802373:G:A;1:8809426:G:T;1:8809671:G:A;1:8809715:G:A 0.00308325 0.00119904 0.00839329 0.317461 0.00240259
The last value is the burden P-Value.