Perform Stacked Regression on Existing Prediction Models
Source:R/pred_stacked_regression.R
pred_stacked_regression.Rd
This function takes a set of existing prediction models, and uses the new dataset to combine/aggregate them into a single 'meta-model', as described in Debray et al. 2014.
Usage
pred_stacked_regression(
x,
positivity_constraint = FALSE,
new_data,
binary_outcome = NULL,
survival_time = NULL,
event_indicator = NULL
)
Arguments
- x
an object of class "
predinfo
" produced by callingpred_input_info
containing information on at least two existing prediction models.- positivity_constraint
TRUE/FALSE denoting if the weights within the stacked regression model should be constrained to be non-negative (TRUE) or should be allowed to take any value (FALSE). See details.
- new_data
data.frame upon which the prediction models should be aggregated.
- binary_outcome
Character variable giving the name of the column in
new_data
that represents the observed binary outcomes (should be coded 0 and 1 for non-event and event, respectively). Only relevant formodel_type
="logistic"; leave asNULL
otherwise. Leave asNULL
ifnew_data
does not contain any outcomes.- survival_time
Character variable giving the name of the column in
new_data
that represents the observed survival times. Only relevant forx$model_type
="survival"; leave asNULL
otherwise.- event_indicator
Character variable giving the name of the column in
new_data
that represents the observed survival indicator (1 for event, 0 for censoring). Only relevant forx$model_type
="survival"; leave asNULL
otherwise.
Value
A object of class "predSR
". This is the same as that detailed
in pred_input_info
, with the added element containing the
estimates of the meta-model obtained by stacked regression.
Details
This function takes a set of (previously estimated) prediction models that were each originally developed for the same prediction task, and pool/aggregate these into a single prediction model (meta-model) using stacked regression based on new data (data not used to develop any of the existing models). The methodological details can be found in Debray et al. 2014.
Given that the existing models are likely to be highly co-linear (since
they were each developed for the same prediction task), it has been
suggested to impose a positivity constraint on the weights of the stacked
regression model (Debray et al. 2014.). If positivity_constraint
is
set to TRUE, then the stacked regression model will be estimated by
optimising the (log-)likelihood using bound constrained optimization
("L-BFGS-B"). This is currently only implemented for logistic regression
models (i.e., if x$model_type
="logistic"). For survival models,
positivity_constraint = FALSE.
new_data
should be a data.frame, where each row should be an
observation (e.g. patient) and each variable/column should be a predictor
variable. The predictor variables need to include (as a minimum) all of the
predictor variables that are included in the existing prediction models
(i.e., each of the variable names supplied to
pred_input_info
, through the model_info
parameter,
must match the name of a variables in new_data
).
Any factor variables within new_data
must be converted to dummy
(0/1) variables before calling this function. dummy_vars
can
help with this. See pred_predict
for examples.
binary_outcome
, survival_time
and event_indicator
are
used to specify the outcome variable(s) within new_data
(use
binary_outcome
if x$model_type
= "logistic", or use
survival_time
and event_indicator
if x$model_type
=
"survival").
References
Debray, T.P., Koffijberg, H., Nieboer, D., Vergouwe, Y., Steyerberg, E.W. and Moons, K.G. (2014), Meta-analysis and aggregation of multiple published prediction models. Statistics in Medicine, 33: 2341-2362
Examples
LogisticModels <- pred_input_info(model_type = "logistic",
model_info = SYNPM$Existing_logistic_models)
SR <- pred_stacked_regression(x = LogisticModels,
new_data = SYNPM$ValidationData,
binary_outcome = "Y")
summary(SR)
#> Existing models aggregated using stacked regression
#> The model stacked regression weights are as follows:
#> (Intercept) LP1 LP2 LP3
#> 0.02781941 0.46448799 0.15626108 0.16282116
#>
#> Updated Model Coefficients
#> =================================
#> Intercept Age SexM Smoking_Status Diabetes Creatinine
#> 1 -2.675134 0.005345728 0.1589948 0.5233706 0.2543348 0.4554044
#>
#> Model Functional Form
#> =================================
#> Age + SexM + Smoking_Status + Diabetes + Creatinine
#Survival model example:
TTModels <- pred_input_info(model_type = "survival",
model_info = SYNPM$Existing_TTE_models,
cum_hazard = list(SYNPM$TTE_mod1_baseline,
SYNPM$TTE_mod2_baseline,
SYNPM$TTE_mod3_baseline))
SR <- pred_stacked_regression(x = TTModels,
new_data = SYNPM$ValidationData,
survival_time = "ETime",
event_indicator = "Status")
summary(SR)
#> Existing models aggregated using stacked regression
#> The model stacked regression weights are as follows:
#> LP1 LP2 LP3
#> -0.2707658 1.8832932 -0.4488339
#>
#> The new model baseline cumulative hazard is:
#> time hazard
#> 1 2.021278e-06 5.338425e-06
#> 2 1.630775e-05 1.067721e-05
#> 3 3.600450e-05 1.601631e-05
#> 4 4.006704e-05 2.135604e-05
#> 5 6.484743e-05 2.669604e-05
#> 6 1.216613e-04 3.203626e-05
#> ...
#>
#> Updated Model Coefficients
#> =================================
#> Age SexM Smoking_Status Diabetes Creatinine
#> 1 0.03363821 0.2725367 0.5354202 0.1595384 0.3142822
#>
#> Model Functional Form
#> =================================
#> Age + SexM + Smoking_Status + Diabetes + Creatinine