caBIG in Action: Speeding research with data analysis workflows

September 11, 2009

Solving the complex mysteries of cancer requires interdisciplinary analysis on large multi-dimensional data sets, often including genomic (SNP and copy number), gene expression and clinical data. Typically these analyses are complex and time-consuming and must be conducted on many data sets to provide reliable results. Most researchers agree that repeatable, scientifically-validated workflows that are simple to design and develop without programming skills, and that can be easily shared among researchers are vitally needed. Such workflows will speed analysis of large data sets and improve the precision of the results by reusing validated analytical processes.

Since its inception, the goal of caBIG has been to reduce the burden of cancer by enabling the type of collaborative research that these workflows support. Ravi Madduri, a computer scientist at the University of Chicago and long-time participant in the caBIG program, has been working for the last two years with a team at the University of Manchester (UK) to develop a solution to this challenge — a user-friendly tool to create reusable analysis workflows.

"Understanding that data sets involved in cancer research are huge, we wanted to create a means by which we could automate the analysis process and then share those automations with others who are likely running similar analyses," he explains.

Choosing a weapon for the war on cancer

After evaluating several available tools, his team settled on Taverna, a widely accepted free software tool developed by the myGrid project. The key benefit to Taverna is that it was designed to help researchers with limited programming skills and limited resources create bioinformatic workflows that can be saved, modified, reused and shared with other scientists.

"We chose Taverna after trying other options and not getting the results we wanted in terms of user independence," says Madduri. "A bonus was that Taverna was already popular in the bioinformatics world as workflow engine, since the user interface is very intuitive and you don't have to manually edit XML documents."

Madduri noted that feedback from users in the research community has been positive, but that his team is working to make the workflow tools and services even simpler to use.

Showing promise...and progress

So far Madduri and his team have created workflows connecting caArray, a microarray data management program and geWorkbench, an extensible toolkit for doing a wide variety of gene expression analyses, to do automated gene expression queries, including, for example, predicting lymphoma types; additional workflows automatically retrieve protein annotations from multiple sources by connecting cancer Biomedical Infrastructure Objects (caBIO), Computational Portal and Analysis System (CPAS), and Grid Protein Information Resource (PIR). More sophisticated analyses, such as automated predictions of lymphoma outcomes based on gene expression clustering have also been developed. Users can view and download various caBIG workflows at http://www.myexperiment.org/search?query=cabig&type=workflows and execute them using the caGrid workflow tools, extend them by adding their own analysis routines or use them in their workflows

Madduri's enthusiasm for this work and the benefits it provides to the research community is obvious in everything he does. "Using Taverna, members of the caBIG community can upload their workflow into caGrid where others can find the routine, run it from the portal and find the results on their own data sets. This reduces the amount of work done by end-users by leveraging the expertise of others, and improves the precision and comparability of work by reusing validated workflows. Along the way we've learned that we need to make sure all solutions we create are based on the needs of end-users, rather than just trying to force a cool tool on researchers."

Additional Resources