Example workflow
This workflow shows how the 1000 Genomes Project, Phase I, chr22, SNP calls data set was processed.
Prerequisites
Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz
Download the human genome (hg19) in 2bit format. This is used by interval2sequences to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
Download the SNP files in VCF format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel
Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz
Preprocessing of the annotation file
Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz
Extract the coding sequence (CDS) elements ignoring mRNA_start_NF
and mRNA_end_NF
grep -v '^#' gencode.v3c.annotation.GRCh37.gtf | awk '/\t(HAVANA|ENSEMBL)\t(CDS|start_codon|stop_codon)\t/ {print}' | grep -v mRNA_end_NF | grep -v mRNA_start_NF > gencode.v3c.annotation.GRCh37.filtered.gtf
Convert the GENCODE GTF file into Interval format:
$ gencode2interval < gencode.v3c.annotation.GRCh37.filtered.gtf > gencode.v3c.annotation.GRCh37.filtered.interval
Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ interval2sequences hg19.2bit gencode.v3c.annotation.GRCh37.filtered.interval exonic > gencode.v3c.annotation.GRCh37.filtered.fa
Annotation of the SNPs
Annotate the variants using snpMapper
$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf
Modification the VCF header line
Modify the VCF header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample
(i.e. CEU:NA0705
).
First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf
Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf
Create the new meta-data lines:
$ vcfModifyHeader ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf
Merge the new meta-data lines with the annotated variants and create a new file called ALL.2of4intersection.20100804.chr22.vcf
:
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf
Compress the newly created VCF file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf
Index the newly created VCF file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz
Generation of summaries and images
Generate gene and sample summaries for the annotated variants
$ vcfSummary ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
Resulting files should be: ALL.2of4intersection.20100804.chr22.geneSummary.txt
and ALL.2of4intersection.20100804.chr22.sampleSummary.txt
Make a new directory to store the images and VCF files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22
Generate an image for each gene with at least one annotated variant.
$ vcf2images ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22
Subset the VCF file with the annotated variants by gene.
$ vcfSubsetByGene ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22
Setting up the web server
Make a gzipped tarball containing all of the relevant files:
- Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
- File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
- File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
- Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
- Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)
$ tar -pczvf ALL.2of4intersection.20100804.chr22.tar.gz \ ALL.2of4intersection.20100804.chr22 \ ALL.2of4intersection.20100804.chr22.geneSummary.txt \ ALL.2of4intersection.20100804.chr22.sampleSummary.txt \ ALL.2of4intersection.20100804.chr22.vcf.gz \ ALL.2of4intersection.20100804.chr22.vcf.gz.tbi
Open the upload page of your VAT installation in your web browser and click on the ÒProcessed data setÓ tab for the upload form for uploading processed data sets. Choose your .tar.gz archive using the file input box and click Submit. Once the file has been processed, click View Results.