TY - JOUR
T1 - Fast Cross-validation for Multi-penalty High-dimensional Ridge Regression
AU - van de Wiel, Mark A.
AU - van Nee, Mirrelijn M.
AU - Rauschenberger, Armin
N1 - Publisher Copyright:
© 2021 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.
PY - 2021
Y1 - 2021
N2 - High-dimensional prediction with multiple data types needs to account for potentially strong differences in predictive signal. Ridge regression is a simple model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, and that allows inclusion of data type-specific penalties. The largest challenge for multi-penalty ridge is to optimize these penalties efficiently in a cross-validation (CV) setting, in particular for GLM and Cox ridge regression, which require an additional estimation loop by iterative weighted least squares (IWLS). Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly all computations are in low-dimensional space, rendering a speed-up of several orders of magnitude. We developed a flexible framework that facilitates multiple types of response, unpenalized covariates, several performance criteria and repeated CV. Extensions to paired and preferential data types are included and illustrated on several cancer genomics survival prediction problems. Moreover, we present similar computational shortcuts for maximum marginal likelihood and Bayesian probit regression. The corresponding R-package, multiridge, serves as a versatile standalone tool, but also as a fast benchmark for other more complex models and multi-view learners. Supplementary materials for this article are available online.
AB - High-dimensional prediction with multiple data types needs to account for potentially strong differences in predictive signal. Ridge regression is a simple model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, and that allows inclusion of data type-specific penalties. The largest challenge for multi-penalty ridge is to optimize these penalties efficiently in a cross-validation (CV) setting, in particular for GLM and Cox ridge regression, which require an additional estimation loop by iterative weighted least squares (IWLS). Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly all computations are in low-dimensional space, rendering a speed-up of several orders of magnitude. We developed a flexible framework that facilitates multiple types of response, unpenalized covariates, several performance criteria and repeated CV. Extensions to paired and preferential data types are included and illustrated on several cancer genomics survival prediction problems. Moreover, we present similar computational shortcuts for maximum marginal likelihood and Bayesian probit regression. The corresponding R-package, multiridge, serves as a versatile standalone tool, but also as a fast benchmark for other more complex models and multi-view learners. Supplementary materials for this article are available online.
KW - Cancer genomics
KW - High-dimensional prediction
KW - Iterative weighted least squares
KW - Marginal likelihood
KW - Multi-view learning
UR - http://www.scopus.com/inward/record.url?scp=85106316569&partnerID=8YFLogxK
U2 - 10.1080/10618600.2021.1904962
DO - 10.1080/10618600.2021.1904962
M3 - Article
AN - SCOPUS:85106316569
SN - 1061-8600
VL - 30
SP - 835
EP - 847
JO - Journal of Computational and Graphical Statistics
JF - Journal of Computational and Graphical Statistics
IS - 4
ER -