The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here. The annotation information is captured as part of the INFO field using the VA (Variant Annotation) tag. The string with the variant information has the following format:

AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

Variant Type1 Transcript name Transcirpt ID Transcript length Relative position of variant2 Relative position of amino acid Amino acid substitution Transcript overlap
SNP synonymous Yes Yes Yes Yes Yes Yes No
nonsynonymous Yes Yes Yes Yes Yes Yes No
prematureStop Yes Yes Yes Yes Yes Yes No
removedStop Yes Yes Yes Yes Yes Yes No
spliceOverlap Yes Yes Yes Yes Yes Yes No
Indel insertionFS Yes Yes Yes Yes Yes Yes No
insertionNFS Yes Yes Yes Yes Yes Yes No
deletionFS Yes Yes Yes Yes Yes Yes No
deletionNFS Yes Yes Yes Yes Yes Yes No
startOverlap Yes Yes Yes No No No No
endOverlap Yes Yes Yes No No No No
spliceOverlap Yes Yes Yes No No No No
SV svOverlap Yes Yes Yes No No No Yes
  1. FS <=> frameshift, NFS <=> non-frameshift
  2. Relative position respect to the transcript start site

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by snpMapper): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by indelMapper), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

Example 1

A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1	23112837	.	A	T	.	PASS	AA=A;AC=7;AN=118;DP=168;SF=2;VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*

Example 2

A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1	1110357	.	G	A	.	PASS	AA=G;AC=3;AN=118;DP=203;SF=2;VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H

Example 3

A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9	35819390	rs2381409	C	T	.	PASS	AA=N;AC=157;AN=240;DP=49;SF=0,1;VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106

Example 4

An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7	140118541	.	TACAACAACA	T,TACA	.	PASS	HP=1;VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using vcfModifyHeader. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1]

  1. Name of the interval
  2. Chromosome
  3. Strand
  4. Interval start (with respect to the "+")
  5. Interval end (with respect to the "+")
  6. Number of sub-intervals
  7. Sub-interval starts (with respect to the "+", comma-delimited)
  8. Sub-interval end (with respect to the "+", comma-delimited)

Note: For the purpose of VAT, the name field in the Interval file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the gencode2interval program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201	chr8	-	134472009	134488267	6	134472009,134474117,134475656,134477020,134478136,134487961	134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202	chr8	-	134472009	134488267	6	134472009,134474117,134475656,134477020,134478136,134487961	134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203	chr8	-	134472009	134488267	6	134472009,134474117,134475656,134477020,134478136,134487961	134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201	chr16	+	3097544	3105947	4	3097544,3100009,3100254,3105830	3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202	chr16	+	3096918	3109096	10	3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827	3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096