Data Formats
VCF
The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here. The annotation information is captured as part of the INFO field using the VA (Variant Annotation) tag. The string with the variant information has the following format:
AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}
All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:
Variant | Type1 | Transcript name | Transcirpt ID | Transcript length | Relative position of variant2 | Relative position of amino acid | Amino acid substitution | Transcript overlap |
---|---|---|---|---|---|---|---|---|
SNP | synonymous | Yes | Yes | Yes | Yes | Yes | Yes | No |
nonsynonymous | Yes | Yes | Yes | Yes | Yes | Yes | No | |
prematureStop | Yes | Yes | Yes | Yes | Yes | Yes | No | |
removedStop | Yes | Yes | Yes | Yes | Yes | Yes | No | |
spliceOverlap | Yes | Yes | Yes | Yes | Yes | Yes | No | |
Indel | insertionFS | Yes | Yes | Yes | Yes | Yes | Yes | No |
insertionNFS | Yes | Yes | Yes | Yes | Yes | Yes | No | |
deletionFS | Yes | Yes | Yes | Yes | Yes | Yes | No | |
deletionNFS | Yes | Yes | Yes | Yes | Yes | Yes | No | |
startOverlap | Yes | Yes | Yes | No | No | No | No | |
endOverlap | Yes | Yes | Yes | No | No | No | No | |
spliceOverlap | Yes | Yes | Yes | No | No | No | No | |
SV | svOverlap | Yes | Yes | Yes | No | No | No | Yes |
Notes:
|
The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by snpMapper): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by indelMapper), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).
Example 1
A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.
chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*
Example 2
A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.
chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H
Example 3
A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.
chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106
Example 4
An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.
chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L
Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).
VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using vcfModifyHeader. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.
Interval
The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1]
- Name of the interval
- Chromosome
- Strand
- Interval start (with respect to the "+")
- Interval end (with respect to the "+")
- Number of sub-intervals
- Sub-interval starts (with respect to the "+", comma-delimited)
- Sub-interval end (with respect to the "+", comma-delimited)
Note: For the purpose of VAT, the name field in the Interval file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the gencode2interval program ensures proper formatting.
Example file:
ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267 ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267 ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267 ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947 ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096