How to run Cromwell for Scientific Workflows on Azure

How to run Cromwell for Scientific Workflows on Azure


Cromwell on Azure. This video demonstrates how to run Cromwell on Azure. Working with genomic data involves many computational steps to convert raw sequencing reads into scientific insight. To manage this, the Broad Institute of MIT and Harvard developed Cromwell, an open-source workflow management system for scientific workflows. Cromwell reads workflow definitions written in the WDL format. Cromwell on Azure is an open source solution that configures all Azure resources needed to run workflows through Cromwell on the Azure cloud. It supports authenticated access via an abstraction layer that supports Azure AAD and direct access via the Cromwell API. Cromwell on Azure uses the GA4GH Task Execution Service backend for orchestrating the tasks that make up a workflow. It sets up a VM host to run the Cromwell server and uses Azure Batch to spin up the virtual machines that run each task in a workflow. This Quickstart walks through how to install Cromwell on Azure and run a sample workflow. First, download the installation executable from GitHub Configure the installation at the command line with your subscription ID, your preferred region, and an Identifier Prefix that will be used to name your Cromwell on azure resource group. Once installed, Cromwell on Azure configures a single resource group with all necessary resources, which you can view on the Azure Portal. This includes: It includes the virtual machine the VM host, which runs the Cromwell server. It includes the virtual machine, disk, network interface o public IP address, and virtual network. The Batch account is connected to the VM host and spins up the virtual machines that run each task in a workflow. The Storage account is mounted to the VM host and stores all inputs and outputs. Application Insights contains all workflow logs to enable task level debugging. Cosmos DB stores information and metadata about each task in each workflow. Now, we’ll show you how to run a sample workflow workflow for converting fastq files to ubam format for chromosome 21. First, upload your input files. I’ve uploaded our paired end reads in fastq format, the wdl file, and the inputs json which specifies the location of our inputs. Next, generate a trigger JSON file, a file that Cromwell on Azure uses to note the input paths and to initiate the workflow. Upload the trigger file in the new folder of the workflows container to initiate the workflow. Cromwell appends a workflow ID to the trigger JSON file name and transfers the file to the “in progress” directory. Cromwell on Azure uses Azure Batch to optimize which VMs are used. Once your workflow is completed it will move to the succeeded directory. You can find the Cromwell output files in the Cromwell-execution folder. Here you see the ubam file that has been created. Other outputs from the Cromwell endpoint, including metadata, outputs, and the timing file can be found in the outputs folder. You can use the timing file to see how long each of your tasks took and to optimize your workflow. In conclusion, you can leverage the Microsoft Cromwell on Azure solution to power your genomics workflows, from sample to answer.

Leave a Reply

Your email address will not be published. Required fields are marked *