Installation of the external GSL and GD libraries

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the GNU Scientific Library website):

$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install

Similarly, the GD library can be installed on most systems with the following commands:

$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

Installation and Configuration of libBIOS

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:path/to/gsl-1.14/lib:/path/to/libbios/lib:/path/to/gd-2.0.35/lib

libBIOS can be installed on most systems with the following commands:

$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd` 
$ make
$ make install

Installation and Configuration of VAT

A few simple steps are required to install VAT:

$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd` 
$ make
$ make install

VAT contains a configuration file that resides in one's home directory as .vatrc and in the web root as vat.conf, which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.

// =============================================================================
// REQUIRED
// =============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabixdir

// Directory where VAT executables are
VAT_EXEC_DIR /path/to/vat_exe


// =============================================================================
// OPTIONAL (required only for CGIs)
// =============================================================================

// Path to processed data sets
WEB_DATA_DIR /path/to/data/sets

// URL to preprocessed files
WEB_DATA_URL https://webserver/data/sets

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_REFERENCE_DIR /path/to/data/reference

WEB_DATA_WORKING_DIR /path/to/data/working

WEB_DATA_RAW_DIR /path/to/data/raw

// =============================================================================
// AWS/S3 Configuration values
// =============================================================================

// Option for turning on or off Amazon Simple Storage Support (S3) support. 
// Use true to activate S3, false to deactivate. Note that if S3 support is
// active, you will need to enter your AWS account infomation for the VAT web
// components in web/lib/aws/config.inc.php
AWS_USE_S3 false

// S3 access key ID
AWS_ACCESS_KEY_ID access_key_id

// S3 secret access key
AWS_SECRET_ACCESS_KEY secret_key

// S3 hostname
AWS_S3_HOSTNAME s3.amazonaws.com

// The name of the S3 bucket for processed data sets. If S3 support is enabled,  
// this bucket is used instead of WEB_DATA_DIR
AWS_S3_DATA_BUCKET data-bucket

// The name of the S3 bucket for raw VCF input files. If S3 support is enabled,
// this bucket is used instead of WEB_DATA_RAW_DIR 
AWS_S3_RAW_BUCKET raw-bucket

// =============================================================================
// Set only if setting up as master node in master/worker configuration
// =============================================================================

// Set to true if we are using the master/worker cluster configuration, false
// if we are running single-node only
CLUSTER false

// IP address of master node. Used by worker to access the master's API
MASTER_ADDRESS xxx.xx.xxx.xx
// ----------------------------------------------------------------------------
// Used by master only:
// ----------------------------------------------------------------------------

// MySQL configuration
MASTER_MYSQL_HOST localhost
MASTER_MYSQL_USER user
MASTER_MYSQL_PASS pass
MASTER_MYSQL_DB dbname

This file has to be configured properly by filling in the required information.

Running make install will copy the configuration file to your home directory as .vatrc and is used when manually running VAT programs on the command line. Subsequently, the environment variable VAT_CONFIG_FILE should be set. It is recommended that your shell start-up script sets this variable:

VAT_CONFIG_FILE=/pathTo/vat/.vatrc

A VAT configuration file also exists in the web root as vat.conf and is expected and loaded by the VAT web application.

This step is optional, but is very useful for visualizing the results of processed data sets.

Configuring PHP

Due to the large file sizes uploaded to VAT, PHP must be configured to allow larger upload sizes. In your php.ini file, set upload_max_filesize and post_max_size to at least 100M:

upload_max_filesize = 100M
post_max_size = 100M

It is also recommended to turn off output buffering so that flush() works properly:

output_buffering = Off

VAT Setup and Configuration

In the web directory under the VAT source tree, the VAT configuration file should have been copied into this directory during make. If it is not present, copy the VAT configuration file default.vatrc from the root of the source tree into the web directory and rename it vat.conf.

Copy the contents of the web directory to your Apache web root directory. This is usually /var/www/html or /var/www. Make the /data directory that contains directory tree used by the VAT I/O layer readable and writable:

$ sudo chmod -R 777 data

You will need to download the GENCODE annotation files used by VAT. The get_annotation_sets.sh script in the /scripts directory under the VAT source tree may be used to download all the necessary annotation files using wget:

$ cd /web/root/data/reference
$ sudo /path/to/vat-x.x.x/scripts/get_annotation_sets.sh

Edit the VAT configuration file in the web root according to your installation. If you wish to set up an Amazon S3-backed installation, create two web-accessible buckets, one for storing raw VCF files and one for storing processed data sets. In your VAT configuration file, enable S3-backed storage by setting the AWS_USE_S3 directive to true and setting your AWS credentials and bucket names:

AWS_USE_S3 true

AWS_ACCESS_KEY_ID access_key_id
AWS_SECRET_ACCESS_KEY secret_key

AWS_S3_DATA_BUCKET data-bucket
AWS_S3_RAW_BUCKET raw-bucket

The WEB_DATA_URL directive must be set to the URL where the processed data sets are stored. If S3-backed storage is enabled, it should be set to the S3 URL of your data bucket:

WEB_DATA_URL http://s3.amazonaws.com/data-bucket

If you are setting up VAT to store all files locally, set WEB_DATA_URL to the URL to the directory where processed data sets are stored, which is by default data/sets:

WEB_DATA_URL http://webserver/data/sets

Regardless of whether S3-backed storage is enabled, the WEB_DATA_WORKING_DIR directive must be set to the working directory that the I/O layer uses to give each VAT process a unique copy of files requested on demand. Also, the WEB_DATA_REFERENCE_DIR directive must be set to the directory containing the reference GENCODE annotation files. By default the directories are data/working and data/reference respectively:

WEB_DATA_WORKING_DIR /web/root/data/working
WEB_DATA_REFERENCE_DIR /web/root/data/reference

If S3-backed storage is disabled, instead of using two S3 buckets, raw VCF files and processed data sets are stored in local directories. The directives WEB_DATA_RAW_DIR and WEB_DATA_DIR must be set to point to the directives used to store raw VCF files and processed data sets, which are by default data/raw and data/sets respectively:

WEB_DATA_RAW_DIR /web/root/data/raw
WEB_DATA_DIR /web/root/data/sets

VCF

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here. The annotation information is captured as part of the INFO field using the VA (Variant Annotation) tag. The string with the variant information has the following format:

AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

Variant Type1 Transcript name Transcirpt ID Transcript length Relative position of variant2 Relative position of amino acid Amino acid substitution Transcript overlap
SNP synonymous Yes Yes Yes Yes Yes Yes No
nonsynonymous Yes Yes Yes Yes Yes Yes No
prematureStop Yes Yes Yes Yes Yes Yes No
removedStop Yes Yes Yes Yes Yes Yes No
spliceOverlap Yes Yes Yes Yes Yes Yes No
Indel insertionFS Yes Yes Yes Yes Yes Yes No
insertionNFS Yes Yes Yes Yes Yes Yes No
deletionFS Yes Yes Yes Yes Yes Yes No
deletionNFS Yes Yes Yes Yes Yes Yes No
startOverlap Yes Yes Yes No No No No
endOverlap Yes Yes Yes No No No No
spliceOverlap Yes Yes Yes No No No No
SV svOverlap Yes Yes Yes No No No Yes
Notes:
  1. FS <=> frameshift, NFS <=> non-frameshift
  2. Relative position respect to the transcript start site

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by snpMapper): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by indelMapper), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

Example 1

A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1	23112837	.	A	T	.	PASS	AA=A;AC=7;AN=118;DP=168;SF=2;VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*

Example 2

A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1	1110357	.	G	A	.	PASS	AA=G;AC=3;AN=118;DP=203;SF=2;VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H

Example 3

A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9	35819390	rs2381409	C	T	.	PASS	AA=N;AC=157;AN=240;DP=49;SF=0,1;VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106

Example 4

An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7	140118541	.	TACAACAACA	T,TACA	.	PASS	HP=1;VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using vcfModifyHeader. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

Interval

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1]

  1. Name of the interval
  2. Chromosome
  3. Strand
  4. Interval start (with respect to the "+")
  5. Interval end (with respect to the "+")
  6. Number of sub-intervals
  7. Sub-interval starts (with respect to the "+", comma-delimited)
  8. Sub-interval end (with respect to the "+", comma-delimited)

Note: For the purpose of VAT, the name field in the Interval file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the gencode2interval program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201	chr8	-	134472009	134488267	6	134472009,134474117,134475656,134477020,134478136,134487961	134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202	chr8	-	134472009	134488267	6	134472009,134474117,134475656,134477020,134478136,134487961	134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203	chr8	-	134472009	134488267	6	134472009,134474117,134475656,134477020,134478136,134487961	134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201	chr16	+	3097544	3105947	4	3097544,3100009,3100254,3105830	3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202	chr16	+	3096918	3109096	10	3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827	3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096

VAT Core Modules

snpMapper

snpMapper is a program to annotate a set of SNPs in VCF format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

Usage
snpMapper <annotation.interval> <annotation.fa>
Inputs Takes a VCF input from STDIN
Outputs Outputs annotated SNPs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
Optional Arguments None

indelMapper

indelMapper is a program to annotate a set of indels in VCF format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

Usage
indelMapper <annotation.interval> <annotation.fa>
Inputs Takes a VCF input from STDIN
Outputs Outputs annotated indels in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
Optional Arguments None.

svMapper

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

Usage
svMapper <annotation.interval>
Inputs Takes a VCF input from STDIN
Outputs Outputs annotated SVs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
Optional Arguments None.

genericMapper

genericMapper is a program to annotate a number of different variants in VCF format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

Usage
genericMapper <annotation.interval> <nameFeature>
Inputs Takes a VCF input from STDIN
Outputs Outputs the annotated variants in VCF format. The annotation information is captured as part of the INFO field.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. This can be a generic Interval.
  • nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
Optional arguments None.

vcfSummary

vcfSummary is a program to aggregate annotated variants across genes and samples.

Usage
vcfSummary <file.vcf.gz> <annotation.interval>
Inputs None
Outputs Generates two output files. The first file, named file.geneSummary.txt, contains the number of variants categorized by type for each gene. A second file, named file.sampleSummary.txt, summarizes number of variants categorized by type for each sample.
Required arguments
  • file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using bgzip and indexed using the tabix program.
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
Optional arguments None.

vcfImages

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

Usage
vcf2images <file.vcf.gz> <annotation.interval> <outputDir>
Inputs None.
Outputs Generates an image in PNG format for each gene that has at least one annotated variant.
Required arguments
  • file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using bgzip and indexed using the tabix program.
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • outputDir - The output directory where the images are stored
Optional Arguments None.

vcfSubsetByGene

vcfSubsetByGene is a program to subset a VCF file with annotated variants by gene.

Usage
vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>
Inputs None.
Outputs Generates a VCF file for each gene that has at least one annotated variant.
Required arguments
  • file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using bgzip and indexed using the tabix program.
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • outputDir - The output directory where VCF files are stored
Optional Arguments None.

vcfModifyHeader

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a VCF file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including vcfSummary).

vcfModifyHeader <oldHeader.vcf> <groups.txt>
Inputs None.
Outputs Generates a VCF header file.
Required arguments
  • oldHeader.vcf - The meta lines of a VCF file. It can be obtained by using the following command:
    grep '#' file.vcf > file.header.vcf
  • groups.txt - This tab-delimited file that assigns each sample present in the VCF to a group/population. Here is a small sample file:
    HG00629 CHS
    HG00634 CHS
    HG00635 CHS
    HG00637 PUR
    HG00638 PUR
    HG00640 PUR
    NA06984 CEU
    NA06985 CEU
    NA06986 CEU
    NA06989 CEU
    NA06994 CEU
Optional arguments None.

Auxiliary programs

gencode2interval

gencode2interval converts a GENCODE annotation file (in GTF format) to the Interval format.

Usage
gencode2interval
Inputs Takes a GENCODE annotation file in GTF format from STDIN
Outputs Outputs the GENCODE annotation file in Interval format to STDOUT
Required arguments None.
Optional arguments None.

Note: Remove all header lines in the annotation file before running gencode2interval. Also filter out coding transcripts that do not have an annotated start or stop as follows:

grep -v '^#' gencode.v19.annotation.gtf | awk '/\t(HAVANA|ENSEMBL)\t(CDS|start_codon|stop_codon)\t/ {print}' | grep -v mRNA_end_NF | grep -v mRNA_start_NF > gencode.v19.annotation.filtered.gtf

interval2sequences

Module to retrieve genomic/exonic sequences for an annotation set in Interval format.

Usage
interval2sequences <file.2bit> <file.annotation> <exonic|genomic>
Inputs None.
Outputs Reports the extracted sequences in FASTA format
Required arguments
  • file.2bit - genome reference sequence in 2bit format
  • file.annotation - annotation set in Interval format (each line represents one transcript)
  • < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
Optional arguments None.

Note: You will want to cd into a directory where you have write permission since interval2sequences may create temporary files

External programs

bgzip/tabix

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the tabix documentation page.

VCF tools

VCF tools consists of a suite of very useful modules to manipulate VCF files. For more information consult the documentation page.

This workflow shows how the 1000 Genomes Project, Phase I, chr22, SNP calls data set was processed.

Prerequisites

Download the GENCODE annotation set (version 3c, hg19):

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by interval2sequences to extract the genomic sequences for the entries specified in the annotation set:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in VCF format and a third file that assigns each sample to a population:

$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:

$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

Preprocessing of the annotation file

Decompress the annotation file:

$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements ignoring mRNA_start_NF and mRNA_end_NF:

grep -v '^#' gencode.v3c.annotation.GRCh37.gtf | awk '/\t(HAVANA|ENSEMBL)\t(CDS|start_codon|stop_codon)\t/ {print}' | grep -v mRNA_end_NF | grep -v mRNA_start_NF > gencode.v3c.annotation.GRCh37.filtered.gtf

Convert the GENCODE GTF file into Interval format:

$ gencode2interval < gencode.v3c.annotation.GRCh37.filtered.gtf > gencode.v3c.annotation.GRCh37.filtered.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.

$ interval2sequences hg19.2bit gencode.v3c.annotation.GRCh37.filtered.interval exonic > gencode.v3c.annotation.GRCh37.filtered.fa

Annotation of the SNPs

Annotate the variants using snpMapper

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

Modification the VCF header line

Modify the VCF header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:

$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 

Store the annotated variants in a separate file:

$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:

$ vcfModifyHeader ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf 

Merge the new meta-data lines with the annotated variants and create a new file called ALL.2of4intersection.20100804.chr22.vcf:

$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created VCF file with the annotated variants:

$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created VCF file with the annotated variants:

$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

Generation of summaries and images

Generate gene and sample summaries for the annotated variants

$ vcfSummary ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files should be: ALL.2of4intersection.20100804.chr22.geneSummary.txt and ALL.2of4intersection.20100804.chr22.sampleSummary.txt

Make a new directory to store the images and VCF files for each gene.

$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.

$ vcf2images ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the VCF file with the annotated variants by gene.

$ vcfSubsetByGene ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Setting up the web server

Make a gzipped tarball containing all of the relevant files:

  • Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
  • File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
  • File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
  • Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
  • Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)
$ tar -pczvf ALL.2of4intersection.20100804.chr22.tar.gz \
   ALL.2of4intersection.20100804.chr22 \
   ALL.2of4intersection.20100804.chr22.geneSummary.txt \
   ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
   ALL.2of4intersection.20100804.chr22.vcf.gz \
   ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Open the upload page of your VAT installation in your web browser and click on the “Processed data set” tab for the upload form for uploading processed data sets. Choose your .tar.gz archive using the file input box and click Submit. Once the file has been processed, click View Results.