snpMapper

snpMapper is a program to annotate a set of SNPs in VCF format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

Usage
snpMapper <annotation.interval> <annotation.fa>
Inputs Takes a VCF input from STDIN
Outputs Outputs annotated SNPs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
Optional Arguments None

indelMapper

indelMapper is a program to annotate a set of indels in VCF format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

Usage
indelMapper <annotation.interval> <annotation.fa>
Inputs Takes a VCF input from STDIN
Outputs Outputs annotated indels in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
Optional Arguments None.

svMapper

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

Usage
svMapper <annotation.interval>
Inputs Takes a VCF input from STDIN
Outputs Outputs annotated SVs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
Optional Arguments None.

genericMapper

genericMapper is a program to annotate a number of different variants in VCF format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

Usage
genericMapper <annotation.interval> <nameFeature>
Inputs Takes a VCF input from STDIN
Outputs Outputs the annotated variants in VCF format. The annotation information is captured as part of the INFO field.
Required arguments
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. This can be a generic Interval.
  • nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
Optional arguments None.

vcfSummary

vcfSummary is a program to aggregate annotated variants across genes and samples.

Usage
vcfSummary <file.vcf.gz> <annotation.interval>
Inputs None
Outputs Generates two output files. The first file, named file.geneSummary.txt, contains the number of variants categorized by type for each gene. A second file, named file.sampleSummary.txt, summarizes number of variants categorized by type for each sample.
Required arguments
  • file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using bgzip and indexed using the tabix program.
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
Optional arguments None.

vcfImages

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

Usage
vcf2images <file.vcf.gz> <annotation.interval> <outputDir>
Inputs None.
Outputs Generates an image in PNG format for each gene that has at least one annotated variant.
Required arguments
  • file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using bgzip and indexed using the tabix program.
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • outputDir - The output directory where the images are stored
Optional Arguments None.

vcfSubsetByGene

vcfSubsetByGene is a program to subset a VCF file with annotated variants by gene.

Usage
vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>
Inputs None.
Outputs Generates a VCF file for each gene that has at least one annotated variant.
Required arguments
  • file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using bgzip and indexed using the tabix program.
  • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
  • outputDir - The output directory where VCF files are stored
Optional Arguments None.

vcfModifyHeader

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a VCF file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including vcfSummary).

vcfModifyHeader <oldHeader.vcf> <groups.txt>
Inputs None.
Outputs Generates a VCF header file.
Required arguments
  • oldHeader.vcf - The meta lines of a VCF file. It can be obtained by using the following command:
    grep '#' file.vcf > file.header.vcf
  • groups.txt - This tab-delimited file that assigns each sample present in the VCF to a group/population. Here is a small sample file:
    HG00629 CHS
    HG00634 CHS
    HG00635 CHS
    HG00637 PUR
    HG00638 PUR
    HG00640 PUR
    NA06984 CEU
    NA06985 CEU
    NA06986 CEU
    NA06989 CEU
    NA06994 CEU
Optional arguments None.

gencode2interval

gencode2interval converts a GENCODE annotation file (in GTF format) to the Interval format.

Usage
gencode2interval
Inputs Takes a GENCODE annotation file in GTF format from STDIN
Outputs Outputs the GENCODE annotation file in Interval format to STDOUT
Required arguments None.
Optional arguments None.

Note: Remove all header lines in the annotation file before running gencode2interval. Also filter out coding transcripts that do not have an annotated start or stop as follows:

grep -v '^#' gencode.v19.annotation.gtf | awk '/\t(HAVANA|ENSEMBL)\t(CDS|start_codon|stop_codon)\t/ {print}' | grep -v mRNA_end_NF | grep -v mRNA_start_NF > gencode.v19.annotation.filtered.gtf
gencode2interval < gencode.v19.annotation.filtered.gtf > gencode.v19.annotation.filtered.interval

interval2sequences

Module to retrieve genomic/exonic sequences for an annotation set in Interval format.

Usage
interval2sequences <file.2bit> <file.annotation> <exonic|genomic>
Inputs None.
Outputs Reports the extracted sequences in FASTA format
Required arguments
  • file.2bit - genome reference sequence in 2bit format
  • file.annotation - annotation set in Interval format (each line represents one transcript)
  • < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
Optional arguments None.

Note: You will want to cd into a directory where you have write permission since interval2sequences may create temporary files

bgzip/tabix

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the tabix documentation page.

VCF tools

VCF tools consists of a suite of very useful modules to manipulate VCF files. For more information consult the documentation page.