This workflow shows how the 1000 Genomes Project, Phase I, chr22, SNP calls data set was processed.

Prerequisites

Download the GENCODE annotation set (version 3c, hg19):

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by interval2sequences to extract the genomic sequences for the entries specified in the annotation set:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in VCF format and a third file that assigns each sample to a population:

$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:

$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

Preprocessing of the annotation file

Decompress the annotation file:

$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements ignoring mRNA_start_NF and mRNA_end_NF

grep -v '^#' gencode.v3c.annotation.GRCh37.gtf | awk '/\t(HAVANA|ENSEMBL)\t(CDS|start_codon|stop_codon)\t/ {print}' | grep -v mRNA_end_NF | grep -v mRNA_start_NF > gencode.v3c.annotation.GRCh37.filtered.gtf

Convert the GENCODE GTF file into Interval format:

$ gencode2interval < gencode.v3c.annotation.GRCh37.filtered.gtf > gencode.v3c.annotation.GRCh37.filtered.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.

$ interval2sequences hg19.2bit gencode.v3c.annotation.GRCh37.filtered.interval exonic > gencode.v3c.annotation.GRCh37.filtered.fa

Annotation of the SNPs

Annotate the variants using snpMapper

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

Modification the VCF header line

Modify the VCF header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:

$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 

Store the annotated variants in a separate file:

$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:

$ vcfModifyHeader ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf 

Merge the new meta-data lines with the annotated variants and create a new file called ALL.2of4intersection.20100804.chr22.vcf:

$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created VCF file with the annotated variants:

$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created VCF file with the annotated variants:

$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

Generation of summaries and images

Generate gene and sample summaries for the annotated variants

$ vcfSummary ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files should be: ALL.2of4intersection.20100804.chr22.geneSummary.txt and ALL.2of4intersection.20100804.chr22.sampleSummary.txt

Make a new directory to store the images and VCF files for each gene.

$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.

$ vcf2images ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the VCF file with the annotated variants by gene.

$ vcfSubsetByGene ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Setting up the web server

Make a gzipped tarball containing all of the relevant files:

  • Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
  • File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
  • File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
  • Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
  • Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)
$ tar -pczvf ALL.2of4intersection.20100804.chr22.tar.gz \
   ALL.2of4intersection.20100804.chr22 \
   ALL.2of4intersection.20100804.chr22.geneSummary.txt \
   ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
   ALL.2of4intersection.20100804.chr22.vcf.gz \
   ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Open the upload page of your VAT installation in your web browser and click on the “Processed data set” tab for the upload form for uploading processed data sets. Choose your .tar.gz archive using the file input box and click Submit. Once the file has been processed, click View Results.