2016 EMANUEL & CAROL PARZEN PRIZE FOR STATISTICAL INNOVATION
Proudly Presented to
Shanti S. Gupta Distinguished Professor of Statistics
Courtesy Professor of Computer Science
Department of Statistics
“Divide & Recombine for Bigger Data and Higher Computational Complexity”
Computational performance is challenging today. Datasets can be big, computational complexity of analytic methods can be high, and computer hardware power can be limited. Divide & Recombine (D&R) is a statistical approach to meet the challenges. The analyst divides the data into subsets by a D&R division method. Each analytic method is applied to each subset, independently, without communication. The outputs of each analytic method are recombined by a D&R recombination method. Sometimes the goal is one result for all of the data, such as a logistic regression; D&R theory and methods seek division and recombination methods to optimize the statistical accuracy. Much more common in practice is a division based on the subject matter. The data are divided by conditioning on variables important to the analysis. In this case the output can be the final result, or further analysis is carried out on the outputs, which is an analytic recombination.
D&R computation is mostly embarrassingly parallel, the simplest parallel computation. DeltaRho software is an open-source implementation of D&R. The front end is the R package datadr, which is a language for D&R. It makes programming D&R simple. At the back end, running on a cluster, is a distributed database and parallel compute engine such as Hadoop, which spreads subsets and outputs across the cluster, and executes the analyst R and datadr code in parallel. The R package RHIPE provides communication between datadr and Hadoop. DeltaRho protects the analyst from management of parallel computation and database management.
D&R with DeltaRho can increase dramatically the data size and analytic computational complexity that can be analyzed on a cluster, whether the hardware power of the cluster is small, medium, or large; and the data can have a memory size that is larger than the physical memory size of the cluster.