This is a command line tool for "flattening" a VCF (4+) file down to simpler TSV files. Essentially, it takes the information from the INFO column and from the sample columns and spreads them out into their own, separate columns.
Before installing vcflatten, you will need to have installed Java (version 1.6+) and BASH (BASH is already installed on Mac OS X and comes with most Linux distributions' base install).
After this, you can grab the latest release as a ZIP file from Github or just use the link below.
The ZIP file contains the application and all its dependencies as a single JAR, along with a BASH script to run it. You will probably need to mark the script as executable.
Alternatively, the following commands will download the latest release, unzip it, and display the help text:
$ wget https://github.com/downloads/innovativemedicine/vcfimp/vcflatten-0.5.2.zip $ unzip vcflatten-0.5.2.zip $ chmod vcflatten-0.5.2/bin/vcflatten $ ./vcflatten-0.5.2/bin/vcflatten --help
To show the purpose of vcflatten we will use an example. Say you have a VCF file that looks like the following (omitting the required metadata header):
And you wish to flatten this down so you can view the AA, AN, and AC info fields for each variant, along with the GT and GL data for each sample. Then, you can run vcflatten with (Note: the delimiter used to separate the INFO fields and the sample data are the same as those used in the VCF file for the INFO and FORMAT columns respectively):
$ vcflatten --info 'AA;AN;AC' --genotype 'GT:GL' ex.vcf.gz
This command will produce 2 new TSV files, 1 for each of the 2 samples.
You can also flatten a VCF file with multiple samples into a single file with the --one-file command line switch. In this case, an extra sample column will be added to the VCF file, so you can determine which sample a particular row belongs to. Using the example above, we could run:
Which will produce the following file:
$ vcflatten --info 'AA;AN;AC' --genotype 'GT:GL' --one-file ex.vcf.gz
There are more options available; to see them, run vcflatten with the --help command line switch.
If an input file is not provided, vcflatten will read from standard input. This means you can pipe the output of some tools into vcflatten. However, because vcflatten may create many output files from a single input file, it doesn't currently provide a mechanism to write the output to standard out. An option to do this, in conjunction with the --one-file switch should be available soon though.
As an example, if you wish to only include certain samples, I'd suggest you pipe the VCF file through vcf-subset (part of the vcftools package) first:
$ vcf-subset -c HG00096,HG00097 ex-full.vcf.gz | vcflatten
The source code (Scala) for vcflatten is available on GitHub. It is actually a sub-project of a larger vcfimp project, which includes a VCF parser written in Scala. Instructions for building from the source code can be found in the README on the project page.