Analyzing large datasets with bootstrap penalization

Kuangnan Fang; Shuangge Ma

doi:10.1002/bimj.201600052

Analyzing large datasets with bootstrap penalization

Biom J. 2017 Mar;59(2):358-376. doi: 10.1002/bimj.201600052. Epub 2016 Nov 21.

Authors

Kuangnan Fang¹, Shuangge Ma^{1

2}

Affiliations

¹ Department of Statistics, Xiamen University, Xiamen, Fujian, China.
² Department of Biostatistics, Yale University, New Haven, CT, 06520, USA.

Abstract

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a "big computer" with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a "small computer". The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.

Keywords: Bootstrap; Computational feasibility; Large datasets; Penalization.

MeSH terms

Computer Simulation*
Data Interpretation, Statistical*
Humans
Models, Statistical
Sample Size
Software

Grants and funding

R01 CA204120/CA/NCI NIH HHS/United States