bigdatagenomics / deca

Distributed exome CNV analyzer. Apache 2 licensed.

Version Matrix

DECA: Distributed Exome CNV Analyzer

Introduction

DECA is a distributed re-implementation of the XHMM exome CNV caller using ADAM and Apache Spark.

Getting Started

Installation

Note: These instructions are shared with other tools that build on ADAM.

Building from Source

You will need to have Maven installed in order to build DECA.

Note: The default configuration is for Hadoop 2.7.3. If building against a different version of Hadoop, please edit the build configuration in the <properties> section of the pom.xml file.

$ git clone https://github.com/.../deca.git
$ cd deca
$ export MAVEN_OPTS="-Xmx512m"
$ mvn clean package

Installing Spark

You'll need to have a Spark release on your system and the $SPARK_HOME environment variable pointing at it; prebuilt binaries can be downloaded from the Spark website. DECA has been developed and tested with Spark 2.1.0 built against Hadoop 2.7 with Scala 2.11, but any more recent Spark distribution should likely work.

Helpful Scripts

The bin/deca-submit script wraps the spark-submit commands to set up and launch DECA.

Commands

$ deca-submit

Usage: deca-submit [<spark-args> --] <deca-args> [-version]

Choose one of the following commands:

             normalize : Normalize XHMM read-depth matrix
              coverage : Generate XHMM read depth matrix from read data
              discover : Call CNVs from normalized read matrix
normalize_and_discover : Normalize XHMM read-depth matrix and discover CNVs
                   cnv : Discover CNVs from raw read data

You can learn more about a command, by calling it without arguments or with --help, e.g.

$ deca-submit normalize_and_discover --help
 -I VAL                   : The XHMM read depth matrix
 -cnv_rate N              : CNV rate (p). Defaults to 1e-8.
 -exclude_targets STRING  : Path to file of targets (chr:start-end) to be excluded from analysis
 -fixed_pc_toremove INT   : Fixed number of principal components to remove if defined. Defaults to undefined.
 -h (-help, --help, -?)   : Print help
 -initial_k_fraction N    : Set initial k to fraction of max components. Defaults to 0.10.
 -max_sample_mean_RD N    : Maximum sample mean read depth prior to normalization. Defaults to 200.
 -max_sample_sd_RD N      : Maximum sample standard deviation of the read depth prior to normalization. Defaults to 150.
 -max_target_length N     : Maximum target length. Defaults to 10000.
 -max_target_mean_RD N    : Maximum target mean read depth prior to normalization. Defaults to 500.
 -max_target_sd_RD_star N : Maximum target standard deviation of the read depth after normalization. Defaults to 30.
 -mean_target_distance N  : Mean within-CNV target distance (D). Defaults to 70000.
 -mean_targets_cnv N      : Mean targets per CNV (T). Defaults to 6.
 -min_partitions INT      : Desired minimum number of partitions to be created when reading in XHMM matrix
 -min_sample_mean_RD N    : Minimum sample mean read depth prior to normalization. Defaults to 25.
 -min_some_quality N      : Min Q_SOME to discover a CNV. Defaults to 30.0.
 -min_target_length N     : Minimum target length. Defaults to 10.
 -min_target_mean_RD N    : Minimum target mean read depth prior to normalization. Defaults to 10.
 -o VAL                   : Path to write discovered CNVs as GFF3 file
 -print_metrics           : Print metrics to the log on completion
 -save_zscores STRING     : Path to write XHMM normalized, filtered, Z score matrix
 -zscore_threshold N      : Depth Z score threshold (M). Defaults to 3.

Using native library algebra libraries

Apache Spark includes the Netlib-Java library for high-performance linear algebra. Netlib-Java can invoke optimized BLAS and Lapack system libraries if available; however, many Spark distributions are built without Netlib-Java system library support. You may be able to use system libraries by including the DECA jar on the Spark driver classpath, e.g.

deca-submit --driver-class-path $DECA_JAR ...

or you may need to rebuild Spark as described in the Spark MLlib guide.

If you see the following warning messages in the log file, you have not successfully invoked the system libraries:

WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.neltlib.NativeRefARPACK

To build DECA with the optimized netlib native code in, you will need to invoke the native-lgpl profile when running Maven:

mvn package -P native-lgpl

We cannot package this code by default, as netlib is licensed under the LGPL and cannot be bundled in Apache 2 licensed code.

Example Usage

Running DECA in "stand-alone" mode on a workstation

A small dataset (30 samples by 300 targets) is distributed as part of the XHMM tutorial. An example DECA command to call CNVs from the pre-computed read-depth matrix and related files on a 16-core workstation with 128 GB RAM is below. Note that you will need to set the DECA_JAR environment variable to point to the jar file created by mvn package, set spark.local.dir to a suitable temporary directory for your system and likely need to change the executor and driver memory to suitable values for your system. The exclude_targets.txt and DATA.RD.txt files from the XHMM tutorial data are also distributed as part of the DECA test resources in the deca-core/src/test/resources/ directory.

From within the unzip'd RUN directory, prepare exclude_targets.txt:

cat low_complexity_targets.txt extreme_gc_targets.txt | sort -u > exclude_targets.txt

then run DECA:

deca-submit \
--master local[16] \
--driver-class-path $DECA_JAR \
--conf spark.local.dir=/path/to/temp/directory \
--conf spark.driver.maxResultSize=0 \
--conf spark.kryo.registrationRequired=true \
--executor-memory 96G --driver-memory 16G \
-- normalize_and_discover \
-min_some_quality 29.5 \
-exclude_targets exclude_targets.txt \
-I DATA.RD.txt \
-o DECA.gff3

The resulting GFF3 file should contain

22      HG00121 DEL     18898402        18913235        9.167771318038923       .       .       END_TARGET=117;START_TARGET=104;Q_SOME=90;Q_START=8;Q_STOP=4;Q_EXACT=9;Q_NON_DIPLOID=90
22      HG00113 DUP     17071768        17073440        25.32122306047942       .       .       END_TARGET=11;START_TARGET=4;Q_SOME=99;Q_START=53;Q_STOP=25;Q_EXACT=25;Q_NON_DIPLOID=99

The exlude_targets.txt file is the unique combination of the extreme_gc_targets.txt and low_complexity_targets.txt files provided in the tutorial data. The min_some_quality parameter is set to 29.5 to mimic XHMM behavior which uses a default minimum SOME quality of 30 after rounding (while DECA applies the filter prior to rounding). Depending on your particular computing environment, you may need to modify the spark-submit configuration parameters. spark.driver.maxResultSize is set to 0 (unlimited) to address errors collecting larger amounts of data to the driver.

The corresponding xcnv output from XHMM is:

SAMPLE  CNV     INTERVAL        KB      CHR     MID_BP  TARGETS NUM_TARG        Q_EXACT Q_SOME  Q_NON_DIPLOID   Q_START Q_STOP  MEAN_RD MEAN_ORIG_RD
HG00121 DEL     22:18898402-18913235    14.83   22      18905818        104..117        14      9       90      90      8       4       -2.51   37.99
HG00113 DUP     22:17071768-17073440    1.67    22      17072604        4..11   8       25      99      99      53      25      4.00    197.73

To call CNVs from the original BAM files:

deca-submit \
--master local[16] \
--driver-class-path $DECA_JAR \
--conf spark.local.dir=/path/to/temp/directory \
--conf spark.driver.maxResultSize=0 \
--conf spark.kryo.registrationRequired=true \
--executor-memory 96G --driver-memory 16G \
-- coverage \
-L EXOME.interval_list \
-I *.bam 
-o DECA.RD.txt

followed by the normalize_and_discovery command above (with DECA.RD.txt as the input). DECA's coverage calculation is designed to match the output of the GATK DepthOfCoverage command specified in the XHMM protocol, i.e. count fragment depth with zero minimum base quality.

Running DECA on a YARN cluster

The equivalent example command to call CNVs on a YARN cluster with Spark dynamic allocation would be:

deca-submit \
	--master yarn \
	--deploy-mode cluster \
	--num-executors 1 \
	--executor-memory 72G \
	--executor-cores 5 \
	--driver-memory 72G \
	--driver-cores 5 \
	--conf spark.driver.maxResultSize=0 \
	--conf spark.yarn.executor.memoryOverhead=4096 \
	--conf spark.yarn.driver.memoryOverhead=4096 \
	--conf spark.kryo.registrationRequired=true \
	--conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize=$(( 8 * 1024 * 1024 )) \
	--conf spark.default.parallelism=10 \
	--conf spark.dynamicAllocation.enabled=true \
	-- normalize_and_discover \
	-min_partitions 10 \
	-exclude_targets "hdfs://path/to/exclude_targets.txt" \
	-min_some_quality 29.5 \
	-I "hdfs://path/to/DATA.RD.txt" \
	-o "hdfs://path/to/DECA.gff3"

Note that many of the parameters above, e.g. driver and executor cores and memory, are specific to a particular cluster environment and would likely need to be modified for other environments.

Running DECA using Toil on a workstation or AWS

We provide Toil workflows that allow DECA to be run either on a local computer or on a cluster on the Amazon Web Services (AWS) cloud. These workflows are written in Python and package DECA, Apache Spark, and Apache Hadoop using Docker containers. This packaging automates the setup of Apache Spark, reducing the barrier-to-entry for using DECA. To run either workflow, the user will need to install Toil. To run the AWS workflow, the user will additionally need to follow the AWS setup instructions.

Note: Support is currently limited to Python 2. Python 3 support is forthcoming.

Installing the DECA Workflows

Once Toil has been installed, the user will need to download and install the bdgenomics.workflows package, which contains the DECA workflows.

Installing from PyPI

For maximum convenience, bdgenomics.workflows is pip installable:

pip install bdgenomics.workflows==0.1.0

Installing from source

To install this package, run make develop:

git clone https://github.com/bigdatagenomics/workflows
cd workflows
make develop

This step should be run inside of a Python virtualenv. If run locally, this step should be run inside of the same virtualenv that Toil was installed into. If run on AWS, this step should be run inside of a virtualenv that was created on the Toil AWS autoscaling cluster.

Input Files

The DECA workflow takes two inputs:

  1. A feature file that defines the regions over which to call copy number variants. This file can be formatted using any of the BED, GTF/GFF2, GFF3, Interval List, or NarrowPeak formats. In the AWS workflow, the ADAM Parquet Feature format is also supported.
  2. A manifest file that contains paths to a set of sorted BAM files. Each file must have a scheme listed. In local mode, the file://, http://, and ftp:// schemes are supported. On AWS, the s3a://, http://, and ftp:// schemes are supported. S3a is an overlay over the AWS Simple Storage System (S3) cloud data store which is provided by Apache Hadoop.

Running Locally

To run locally, we invoke the following command:

bdg-deca \
  --targets <regions> \
  --samples <manifest> \
  --output-dir <path-to-save> \
  --memory <memory-in-GB> \
  --run-local \
  file:<toil-jobstore-path>

This command will run in Toil’s single machine mode, and will save the CNV calls to <path-to-save>/cnvs.gff. <toil-jobstore-path> is the path to a temporary directory where Toil will save intermediate files. The <memory-in-GB> parameter should be specified without units; e.g., to allocate 20GB of memory, pass "--memory 20".

Running on AWS

To run on AWS, we rely on Toil’s AWS provisioner, which starts a cluster on the AWS cloud. Toil’s AWS provisioner runs on top of Apache Mesos and supports dynamically scaling the number of nodes in the cluster to the amount of tasks being run. First, create a Toil cluster on AWS.

Once the Toil cluster has launched, SSH onto the cluster, following the instructions provided in the Toil/AWS documentation. To install bdgenomics.workflows, run:

apt-get update
apt-get install git
git clone https://github.com/bigdatagenomics/workflows.git
cd workflows
virtualenv --system-site-packages venv
. venv/bin/activate
make develop

To run the DECA workflow, invoke the following command:

bdg-deca \
  --targets <regions> \
  --samples <manifest> \
  --output-dir <path-to-save> \
  --memory <memory-in-GB> \
  --provisioner aws \
  --batchSystem mesos \
  --mesosMaster $(hostname -i):5050 \
  --nodeType <type> \
  --num-nodes <spark-workers + 1> \
  --minNodes <spark-workers + 2> \
  aws:<region>:<toil-jobstore>

Toil will launch a cluster with spark-workers + 2 worker nodes to run this workflow. For optimal performance, we recommend choosing a number of Apache Spark worker nodes such that you have no less than 256MB of data per core. All file paths used in AWS mode must be files stored in AWS’s S3 storage system, and must have an s3a:// URI scheme.

License

DECA is released under an Apache 2.0 license.