
Perform Stacked Regression on Existing Prediction Models
Source:R/pred_stacked_regression.R
pred_stacked_regression.RdThis function takes a set of existing prediction models, and uses the new dataset to combine/aggregate them into a single 'meta-model', as described in Debray et al. 2014.
Usage
pred_stacked_regression(
x,
positivity_constraint = FALSE,
new_data,
binary_outcome = NULL,
survival_time = NULL,
event_indicator = NULL
)Arguments
- x
an object of class "
predinfo" produced by callingpred_input_infocontaining information on at least two existing prediction models.- positivity_constraint
TRUE/FALSE denoting if the weights within the stacked regression model should be constrained to be non-negative (TRUE) or should be allowed to take any value (FALSE). See details.
- new_data
data.frame upon which the prediction models should be aggregated.
- binary_outcome
Character variable giving the name of the column in
new_datathat represents the observed binary outcomes (should be coded 0 and 1 for non-event and event, respectively). Only relevant formodel_type="logistic"; leave asNULLotherwise. Leave asNULLifnew_datadoes not contain any outcomes.- survival_time
Character variable giving the name of the column in
new_datathat represents the observed survival times. Only relevant forx$model_type="survival"; leave asNULLotherwise.- event_indicator
Character variable giving the name of the column in
new_datathat represents the observed survival indicator (1 for event, 0 for censoring). Only relevant forx$model_type="survival"; leave asNULLotherwise.
Value
A object of class "predSR". This is the same as that detailed
in pred_input_info, with the added element containing the
estimates of the meta-model obtained by stacked regression.
Details
This function takes a set of (previously estimated) prediction models that were each originally developed for the same prediction task, and pool/aggregate these into a single prediction model (meta-model) using stacked regression based on new data (data not used to develop any of the existing models). The methodological details can be found in Debray et al. 2014.
Given that the existing models are likely to be highly co-linear (since
they were each developed for the same prediction task), it has been
suggested to impose a positivity constraint on the weights of the stacked
regression model (Debray et al. 2014.). If positivity_constraint is
set to TRUE, then the stacked regression model will be estimated by
optimising the (log-)likelihood using bound constrained optimization
("L-BFGS-B"). This is currently only implemented for logistic regression
models (i.e., if x$model_type="logistic"). For survival models,
positivity_constraint = FALSE.
new_data should be a data.frame, where each row should be an
observation (e.g. patient) and each variable/column should be a predictor
variable. The predictor variables need to include (as a minimum) all of the
predictor variables that are included in the existing prediction models
(i.e., each of the variable names supplied to
pred_input_info, through the model_info parameter,
must match the name of a variables in new_data).
Any factor variables within new_data must be converted to dummy
(0/1) variables before calling this function. dummy_vars can
help with this. See pred_predict for examples.
binary_outcome, survival_time and event_indicator are
used to specify the outcome variable(s) within new_data (use
binary_outcome if x$model_type = "logistic", or use
survival_time and event_indicator if x$model_type =
"survival").
References
Debray, T.P., Koffijberg, H., Nieboer, D., Vergouwe, Y., Steyerberg, E.W. and Moons, K.G. (2014), Meta-analysis and aggregation of multiple published prediction models. Statistics in Medicine, 33: 2341-2362
Examples
LogisticModels <- pred_input_info(model_type = "logistic",
model_info = SYNPM$Existing_logistic_models)
SR <- pred_stacked_regression(x = LogisticModels,
new_data = SYNPM$ValidationData,
binary_outcome = "Y")
summary(SR)
#> Existing models aggregated using stacked regression
#> The model stacked regression weights are as follows:
#> (Intercept) LP1 LP2 LP3
#> 0.02781941 0.46448799 0.15626108 0.16282116
#>
#> Updated Model Coefficients
#> =================================
#> Intercept Age SexM Smoking_Status Diabetes Creatinine
#> 1 -2.675134 0.005345728 0.1589948 0.5233706 0.2543348 0.4554044
#>
#> Model Functional Form
#> =================================
#> Age + SexM + Smoking_Status + Diabetes + Creatinine
#Survival model example:
TTModels <- pred_input_info(model_type = "survival",
model_info = SYNPM$Existing_TTE_models,
cum_hazard = list(SYNPM$TTE_mod1_baseline,
SYNPM$TTE_mod2_baseline,
SYNPM$TTE_mod3_baseline))
SR <- pred_stacked_regression(x = TTModels,
new_data = SYNPM$ValidationData,
survival_time = "ETime",
event_indicator = "Status")
summary(SR)
#> Existing models aggregated using stacked regression
#> The model stacked regression weights are as follows:
#> LP1 LP2 LP3
#> -0.2707658 1.8832932 -0.4488339
#>
#> The new model baseline cumulative hazard is:
#> time hazard
#> 1 2.021278e-06 5.338425e-06
#> 2 1.630775e-05 1.067721e-05
#> 3 3.600450e-05 1.601631e-05
#> 4 4.006704e-05 2.135604e-05
#> 5 6.484743e-05 2.669604e-05
#> 6 1.216613e-04 3.203626e-05
#> ...
#>
#> Updated Model Coefficients
#> =================================
#> Age SexM Smoking_Status Diabetes Creatinine
#> 1 0.03363821 0.2725367 0.5354202 0.1595384 0.3142822
#>
#> Model Functional Form
#> =================================
#> Age + SexM + Smoking_Status + Diabetes + Creatinine