vcflatten

This is a command line tool for "flattening" a VCF (4+) file down to simpler TSV files. Essentially, it takes the information from the INFO column and from the sample columns and spreads them out into their own, separate columns.

Installing vcflatten

Before installing vcflatten, you will need to have installed Java (version 1.6+) and BASH (BASH is already installed on Mac OS X and comes with most Linux distributions' base install).

After this, you can grab the latest release as a ZIP file from Github or just use the link below.

The ZIP file contains the application and all its dependencies as a single JAR, along with a BASH script to run it. You will probably need to mark the script as executable.

Alternatively, the following commands will download the latest release, unzip it, and display the help text:

$ wget https://github.com/downloads/innovativemedicine/vcfimp/vcflatten-0.5.2.zip
$ unzip vcflatten-0.5.2.zip
$ chmod vcflatten-0.5.2/bin/vcflatten
$ ./vcflatten-0.5.2/bin/vcflatten --help

Using vcflatten

To show the purpose of vcflatten we will use an example. Say you have a VCF file that looks like the following (omitting the required metadata header):

Input
File: ex.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO (abbreviated) FORMAT HG00096 HG00097
13 32889669 rs55880202 C T 100 PASS AA=C;AN=2184;LDAF=0.0102;... GT:DS:GL 0|0:0.000:-0.18,-0.47,-2.41 0|0:0.000:-0.48,-0.48,-0.48
13 32889792 rs206118 A G 100 PASS AN=2184;AC=341;VT=SNP;... GT:DS:GL 0|0:0.000:-0.10,-0.68,-4.70 1|0:0.850:-0.04,-1.01,-5.00
13 32889968 rs206119 G A 100 PASS AVGPOST=0.9291;AN=2184;... GT:DS:GL 1|1:2.000:-5.00,-0.91,-0.06 1|1:2.000:-5.00,-1.84,-0.01

And you wish to flatten this down so you can view the AA, AN, and AC info fields for each variant, along with the GT and GL data for each sample. Then, you can run vcflatten with (Note: the delimiter used to separate the INFO fields and the sample data are the same as those used in the VCF file for the INFO and FORMAT columns respectively):

$ vcflatten --info 'AA;AN;AC' --genotype 'GT:GL' ex.vcf.gz

This command will produce 2 new TSV files, 1 for each of the 2 samples.

Output
File: ex.vcf.gz-HG00096-1.tsv
#CHROM POS ID REF ALT QUAL FILTER AA AN AC GT GL
13 32889669 rs55880202 C T 100 PASS C 2184 19 0|0 -0.18,-0.47,-2.41
13 32889792 rs206118 A G 100 PASS A 2184 341 0|0 -0.1,-0.68,-4.7
13 32889968 rs206119 G A 100 PASS A 2184 1602 1|1 -5.0,-0.91,-0.06
Output
File: ex.vcf.gz-HG00097-2.tsv
#CHROM POS ID REF ALT QUAL FILTER AA AN AC GT GL
13 32889669 rs55880202 C T 100 PASS C 2184 19 0|0 -0.48,-0.48,-0.48
13 32889792 rs206118 A G 100 PASS A 2184 341 1|0 -0.04,-1.01,-5.0
13 32889968 rs206119 G A 100 PASS A 2184 1602 1|1 -5.0,-1.84,-0.01

You can also flatten a VCF file with multiple samples into a single file with the --one-file command line switch. In this case, an extra sample column will be added to the VCF file, so you can determine which sample a particular row belongs to. Using the example above, we could run:

$ vcflatten --info 'AA;AN;AC' --genotype 'GT:GL' --one-file ex.vcf.gz
Which will produce the following file:
Output
File: ex.vcf.gz-all-1.tsv
#CHROM POS ID REF ALT QUAL FILTER AA AN AC SAMPLE GT GL
13 32889669 rs55880202 C T 100 PASS C 2184 19 HG00096 0|0 -0.18,-0.47,-2.41
13 32889669 rs55880202 C T 100 PASS C 2184 19 HG00097 0|0 -0.48,-0.48,-0.48
13 32889792 rs206118 A G 100 PASS A 2184 341 HG00096 0|0 -0.1,-0.68,-4.7
13 32889792 rs206118 A G 100 PASS A 2184 341 HG00097 1|0 -0.04,-1.01,-5.0
13 32889968 rs206119 G A 100 PASS A 2184 1602 HG00096 1|1 -5.0,-0.91,-0.06
13 32889968 rs206119 G A 100 PASS A 2184 1602 HG00097 1|1 -5.0,-1.84,-0.01

There are more options available; to see them, run vcflatten with the --help command line switch.

Integrating with Other Tools

If an input file is not provided, vcflatten will read from standard input. This means you can pipe the output of some tools into vcflatten. However, because vcflatten may create many output files from a single input file, it doesn't currently provide a mechanism to write the output to standard out. An option to do this, in conjunction with the --one-file switch should be available soon though.

As an example, if you wish to only include certain samples, I'd suggest you pipe the VCF file through vcf-subset (part of the vcftools package) first:

$ vcf-subset -c HG00096,HG00097 ex-full.vcf.gz | vcflatten

Getting the Source

The source code (Scala) for vcflatten is available on GitHub. It is actually a sub-project of a larger vcfimp project, which includes a VCF parser written in Scala. Instructions for building from the source code can be found in the README on the project page.