Skip to main content

Abstract

Background:

There have been multiple efforts toward individual prediction of recurrent strokes based on structured clinical and imaging data using machine learning algorithms. Some of these efforts resulted in relatively accurate prediction models. However, acquiring clinical and imaging data is typically possible at provider sites only and is associated with additional costs. Therefore, we developed recurrent stroke prediction models based solely on data easily obtained from the patient at home.

Methods:

Data from 384 patients with ischemic stroke were obtained from the Erlangen Stroke Registry. Patients were followed at 3 and 12 months after first stroke and then annually, for about 2 years on average. Multiple machine learning algorithms were applied to train predictive models for estimating individual risk of recurrent stroke within 1 year. Double nested cross-validation was utilized for conservative performance estimation and models’ learning capabilities were assessed by learning curves. Predicted probabilities were calibrated, and relative variable importance was assessed using explainable artificial intelligence techniques.

Results:

The best model achieved the area under the curve of 0.70 (95% CI, 0.64–0.76) and relatively good probability calibration. The most predictive factors included patient’s family and housing circumstances, rehabilitative measures, age, high calorie diet, systolic and diastolic blood pressures, percutaneous endoscopic gastrotomy, number of family doctor’s home visits, and patient’s mental state.

Conclusions:

Developing fairly accurate models for individual risk prediction of recurrent ischemic stroke within 1 year solely based on registry data is feasible. Such models could be applied in a home setting to provide an initial risk assessment and identify high-risk patients early.

Graphical Abstract

In 2019, stroke was the third most common cause of disability over all ages globally, being responsible for 5.7% (5.1–6.2) of all-case disability-adjusted life-years (DALYs).1 This poses an increase of 32.4% (22.0–42.2) in disability-adjusted life-years comparing to 1990.1 The disease burden is even more striking for the population above 50 years of age, where stroke poses globally the second most common cause of disability.1
After having a stroke, there is an increased risk of recurrence,1 emphasizing the need for closer risk monitoring and timely therapy adjustments in patients who already had a stroke. Early identification of patients at increased risk of recurrence increases the opportunity for stroke prevention, as special attention and available resources can be devoted to such patients. To achieve this, accurate risk prediction models are necessary. Some clinical risk scores have been developed,2,3,4 2 of which were applied to assess the risk of recurring stroke within 1 year.5,6 Their predictive performance varied significantly depending on evaluation methodologies, derivation populations, outcomes, and prediction time horizons, impeding their direct comparison.7 For instance, the Stroke Prognosis Instrument II (SPI-II) was developed for patients with nondisabling ischemic stroke or transient ischemic attack to predict the combined outcome of stroke or death within 2 years,2 while the California Risk Score was derived to predict the stroke risk within 90 days following transient ischemic attack.3
The most prominent score developed for estimating the short-term (at 90 days) risk of recurrent ischemic stroke is the Recurrence Risk Estimator (RRE-90).4 It reached area under the receiver operating characteristic curve (AUROC) values of 0.70 to 0.82.7 Custom models for short-term recurrent stroke prediction were trained for specific subpopulations of patients with large artery disease8 and atrial fibrillation9 using cox and logistic and regression, respectively, showing moderate AUROC values of 0.62 to 0.70.
The best-known scores to estimate the long-term (1 year) risk of ischemic stroke recurrence are the Essen Stroke Risk Score (ESRS)5 and the modified ESRS.6 ESRS is based on patient age, several comorbidities (including hypertension, diabetes, etc), previous myocardial infraction, and smoking status. Modified ESRS was created by including sex, stroke subtype by etiology, and waist circumference. While not being developed for short-term predictions,10 ESRS did not show good performance at 1-year either (AUROC, 0.56 [95% CI, 0.40–0.64]).11 These poor-to-moderate results of the state-of-the-art scores, as well as the estimated pooled cumulative risk of ischemic stroke recurrence within 1 year of 11.1% (95% CI, 9.0–13.3)12 underline the importance of developing an accurate tool for long-term predictions. Such tool would enable patient risk stratification and targeted assignment of scarce health care resources to the high-risk patients.
Attempts have been made to identify predictors of recurrent stroke using Cox regression without developing a prediction model.13,14 Logistic regression was used with only clinical and imaging variables (AUROC, 0.71), only retinal characteristics (AUROC, 0.65), and both (AUROC, 0.74) whereby performance was measured on the same data used for model development (no separate test data).15 Nonlinear machine learning algorithms reached accuracy of 0.65 (details about variables and algorithms used and AUROC not reported).16 Neural networks achieved AUROC of 0.77 with interquartile range of 0.68 to 0.84 in 10 runs of 5-fold cross-validation relying on clinical and imaging data and focusing on transient ischemic attacks and minor strokes.17 Majority class (no recurrent stroke within 1 year) was randomly undersampled before and not within cross-validation, which introduced bias. We investigated the feasibility of predicting individual risk of recurrent ischemic stroke within 1 year using long-term data from a population-based registry.

Methods

Study Design and Participants

We analyzed anonymized data of 384 patients from the Erlangen Stroke Registry (ESPro).18 The study was approved by the Ethics Committee of the Medical Faculty of Friedrich-Alexander University Erlangen-Nürnberg (Reference number: 249_15 Bc). Written informed consent to participate was given by patients or their legal representatives. ESPro is an ongoing, population-based, prospective, longitudinal regional study focusing on stroke and vascular dementia. The study was started in 1994 and currently covers the population of 112 385 inhabitants of Erlangen in Northern Bavaria, Germany. ESPro comprises data of 10 000 cerebrovascular events (status on September 27, 2021) (both hospitalized and outpatient) with about 1500 annual follow-ups, making it the largest stroke registry in Germany. Patients are followed at 3 and 12 months after the initial stroke event and then annually. Data on 250 variables, including demographics, comorbidities, interactions with health care system, and limited clinical parameters are collected during interviews with patients or their representatives. The methodology and characteristics of the ESPro population are described elsewhere.18 Because of the sensitive nature of the data collected for this study, requests to access the dataset from qualified researchers trained in human subject confidentiality protocols may be sent to the Interdisciplinary Center for Health Technology Assessment (HTA) and Public Health, Friedrich-Alexander University Erlangen-Nürnberg at [email protected]. Beause of local data privacy requirements, only data of patients who had expired before the start of this study were included in the analysis. Data preparation, model development, and evaluation were performed in the Python 3.7.4 programming language and the corresponding packages including pandas 1.0.1, scikit-learn 0.22.1, and imblearn 0.6.2. The code ownership stays with the industrial project partner.

Data Preparation

The unit of analysis in this modeling study was a patient follow-up. The outcome to be predicted was binary: occurrence of recurrent ischemic stroke within 1 year from the follow-up. This time horizon was selected based on the available data (predominantly annual follow-ups) but also due to the lack of accurate long-term risk predictors. The data were collected at 1189 follow-ups of 384 patients (mean number of follow-ups per patient 3.09 [95% CI, 2.97–3.21]), with mean age 78.8 [95% CI, 77.8–79.8] years and 201 (52.3%) were female (Table 1). Three additional potential predictors were computed: dynamic patient age (for each follow-up, age at baseline plus the relative time between baseline date and follow-up date), time since last stroke and number of previous strokes. After absolute dates and 113 variables with > 60% of missing values were removed,19 141 variables remained and were included in the expert variable selection (Table S1). Two stroke experts selected 93 potential predictors on the basis of the quality indicators for Stroke Care of the German Stroke Registers Study.20 Categorical variables were dummy encoded (one binary variable for each category) and numerical variables with >40% of missing values were replaced with missing value indicators, which specify whether a value was missing (1) or not (0). The remaining numerical missing values were imputed iteratively as described in the model development and evaluation section. After removing redundant variables and categorizing the dosage of acetylsalicylic acid and Barthel index, the final dataset included 119 predictors (10 numerical and 109 binary). All feature selection steps up to this point were done in an unsupervised manner (ie, target variable was not used) and therefore performed within data preparation before modeling. As the last step of feature selection was a supervised one (chi-squared test), it was included in the machine learning pipeline to avoid data leakage as described in the next section. Details of the data preparation workflow are given in Figure S1.
Table 1. Baseline Characteristics
Characteristic*Included patients with stroke
Age, y78.8 (10.0)
Sex
 Male183 (47.7)
 Female201 (52.3)
Body mass index, kg/m225.1 (4.3)
TOAST classification
 Large artery artherosclerosis22 (5.7)
 Cardioembolism106 (27.6)
 Small artery occlusion85 (22.1)
 Other determined2 (0.6)
 Undetermined169 (44.0)
Barthel index11.8 (7.7)
*
Data are presented as number (%) or mean (SD). TOAST indicates Trial of ORG 10172 in Acute Stroke Treatment.

Model Development and Evaluation

Since the outcome was binary, the prediction task was treated as a binary classification problem with following classes: (1) recurrence and (2) no recurrence within 1 year from the follow-up. For this reason and also to increase comparability with related work, the selected algorithm performance metric was the AUROC. There were 3 major challenges taken care of in the analysis: class imbalance, missing data, and dimensionality curse. The classes were highly skewed, with only 89 recurrent strokes recorded within 1 year (7.49%). Five machine learning approaches were deployed to handle this: random undersampling of the majority class,21 synthetic oversampling of the minority class using the Synthetic Minority Oversampling Technique algorithm,21 cost-sensitive learning21 (assigning higher penalty to misclassification of the minority class, COST), anomaly detection algorithms (treating minority class as an anomaly to be detected),22 and balanced learning algorithms,23 where class balancing is embedded in the learning algorithm.
The percentage of missing values in numerical variables varied from 0% in patient age to 59% in high blood pressure treatment duration. Those numerical variables with >40% of missing values were replaced with missing value indicators as described in the data preparation step. Other numerical variables, containing 40% or less missing values, were imputed iteratively within double nested cross-validation procedure to avoid any data leakage. The imputation threshold of 40% was set to avoid imputing the majority of (missing) observations based on the minority of them.
Iterative imputation was performed by modeling each variable with missing values as a function of other variables using regularized linear regression and applying those models to estimate missing values.24 Just like random undersampling and Synthetic Minority Oversampling Technique resampling models, imputation models were trained only on the training data subsets to avoid data leakage. Imputed numerical variables were standardized to reach zero-mean and unit variance.
The curse of dimensionality relates to the problem of having a high number of variables for a given number of data points (follow-ups). In this case, the data are sparse, and learning algorithms can discover incorrect patterns. This problem was addressed by dimensionality reduction (ie, the χ2 test was used to select the 10 most relevant binary variables, matching their number to 10 already available numerical variables). By allowing no > 20 variables in the final model, at least 50 observations per variable were available to the learning algorithm to additionally reduce the chance of overfitting.
After addressing these challenges, the machine learning algorithm was chosen. According to the no free lunch theorem, any 2 algorithms are equivalent when their performance is averaged across all possible problems.25 An extensive set of 25 learning algorithms was evaluated, many combined with random undersampling, Synthetic Minority Oversampling Technique and COST, resulting in 52 tested approaches (Table 2). Most of algorithms had specific hyperparameters, such as the regularization parameter of logistic regression, which were tuned during model development (Table S4). Developed models are dynamic as that they can re-estimate the risk whenever new data becomes available.
Table 2. Results of Applied Machine Learning Algorithms in Combination With Different Strategies for Treating Class Imbalance
Learning algorithm*RUSSMOTECOSTADBLA
Most-frequent dummy0.50 (0.00)0.50 (0.00)
Logistic regression0.58 (0.15)0.66 (0.09)0.56 (0.11)
Naïve Bayes0.56 (0.09)0.63 (0.08)
Linear SVM0.59 (0.13)0.70 (0.07)0.53 (0.09)
Ridge classifier0.59 (0.14)0.68 (0.08)0.64 (0.12)
Linear discriminant analysis0.60 (0.11)0.67 (0.09)
Decision tree0.52 (0.05)0.59 (0.05)0.58 (0.05)
k-nearest neighbors0.55 (0.09)0.58 (0.06)
Nonlinear SVM0.57 (0.12)0.56 (0.06)0.59 (0.11)
Multi-layer perceptron0.56 (0.08)0.61 (0.03)
Gaussian process classifier0.51 (0.04)0.52 (0.10)
Random forest0.57 (0.08)0.65 (0.09)0.64 (0.08)
Extra trees0.60 (0.11)0.64 (0.09)0.64 (0.07)
AdaBoost0.54 (0.04)0.65 (0.07)
XGBoost0.50 (0.06)0.61 (0.05)0.59 (0.07)
Stacking meta-classifier0.58 (0.13)0.61 (0.05)
Voting classifier0.58 (0.09)0.64 (0.06)
SGD classifier0.55 (0.11)0.67 (0.04)0.54 (0.10)
Elliptic envelope0.55 (0.06)
One-class SVM0.49 (0.11)
Isolation forest0.57 (0.05)
Balanced bagging0.57 (0.05)
Balanced random forest0.62 (0.07)
Easy ensemble0.64 (0.08)
RUSBoost0.58 (0.07)
ADBLA indicates Anomaly Detection and Balanced Learning Algorithms; AUROC, area under the receiver operating characteristic curve; COST, cost-sensitive learning; RUS, random undersampling; SGD, stochastic gradient descent; SMOTE, synthetic minority oversampling technique; and SVM, support vector machine.
*
Data are presented as mean (SD) of AUROC values over 5 folds of the outer double nested cross-validation loop.
The models employed predict not only the binary class label but also its probability, that is, the risk of recurrent stroke. An important concept for assessing the quality of predicted probabilities, especially in the presence of class imbalance, is calibration plots. They show how well the predicted stroke probabilities match the observed frequency of strokes. Calibration plots are created by grouping the predicted probabilities into a fixed number of groups and plotting the mean prediction for each group (x-axis) against the observed stroke frequency in that group (y-axis). The line x = y indicates a perfectly calibrated model. To improve probability calibration of the employed models, isotonic regression was applied before performance evaluation.26 To provide insights in the importance of variables for the prediction, SHapley Additive exPlanations (SHAP framework) was applied.27 Finally, learning curves were used to evaluate how the size of training data affects the model performance.
The machine learning pipeline (ie, the sequence of data processing steps) consisting of: missing value imputation for numerical variables, χ2 selection of binary variables, resampling (where cost-sensitive, balanced learning and anomaly detection algorithms were not applied), model training with hyperparameter optimization, and probability calibration was validated using double-nested cross-validation protocol (Figure S2).28 At first, the whole dataset was randomly divided into the development set (1083 follow-ups of 345 patients with 81 recurrent strokes) and hold-out set (106 follow-ups of different 39 patients with 8 recurrent strokes, which was used for calibration, variable importance, and additional performance evaluation). The development set was used in double nested cross-validation, consisting of 5 k-fold cross-validation loops (k=5 in each loop). These loops were used for the independent tasks of hyperparameter optimization, probability calibration, and unbiased and conservative performance estimate.28 To avoid data leakage, it was ensured that all follow-ups from the same patient are either in the training or the test set in each fold of the double nested cross-validation. The optimal decision threshold was determined using Youden’s J statistic.29 More technical details and the whole machine learning pipeline are given in Figure 1. To improve transparent reporting of the machine learning modeling approach, a completed MI-CLAIM checklist30 is provided as a separate supplement file.
Figure 1. Machine learning pipeline. Here, 119 stated variables relate to predictors (input variables). The whole machine learning pipeline is provided to the double nested cross-validation protocol for model training, hyperparameter optimization, probability calibration, and performance evaluation. The hold-out set was used for an additional check of the model performance on unseen data, as well as for estimating the importance of single variables for the prediction. COST indicates cost-sensitive learning; FS, feature selection; RUS, random undersampling; and SMOTE, synthetic minority oversampling technique.

Results

The best prediction performance was achieved by the linear support vector machine algorithm19 in the combination with Synthetic Minority Oversampling Technique. The AUROC was 0.70 (95% CI, 0.64–0.76), as measured by double nested cross-validation (Figure 2A). The pooled confusion matrix over 5 test cross-validation folds showed specificity of 0.78 and sensitivity of 0.63 (Figure 2B). These 2 metrics are selected as they represent common metrics for evaluating the utility of binary classification models in medical applications. An additional evaluation on the hold-out set showed comparable results (AUROC of 0.72). Performances of other approaches were significantly lower (Table 2). The learning curve reveals steadily growing AUROC when more data are provided to the learning algorithm (Figure 2C). The training AUROC depicted by the upper line finally comes close to the test AUROC, indicating that the model does not suffer from significant overfitting. The learning curve also shows that model test performance is likely to grow further with more follow-ups but likely not above the upper bound defined by the training AUROC (0.74). The top 10 variables influencing the predictions, according to the SHAP framework, were widow(er) marital status, received rehabilitation, living situation, age, high calorie food, mean diastolic blood pressure, percutaneous endoscopic gastrotomy, GP home visits, Mini Mental Status Test, and mean systolic blood pressure (Figure 2D). The direction of influence of these variables on the risk of recurrent stroke is illustrated in Figure S3 and discussed in the next section. See Table S2 for list of all 20 variables in the final model. Model calibration showed significant improvement after applying isotonic regression compared to uncalibrated support vector machine model (Figure 2E and 2F). Nonetheless, the calibrated model still overestimates of the risk of recurrent stroke considerably, especially in the mid and high ranges of the stroke probability.
Figure 2. Model diagnostics for the linear support vector machine algorithm combined with synthetic minority oversampling technique (SMOTE) resampling technique. A, Double nested-cross validation receiver operating characteristic curve. B, Pooled normalized confusion matrix. C, Algorithm learning curve. D, SHAP variable importance based on magnitude of variable attributions. E, Original calibration plot without probability calibration. F, Calibration plot after probability calibration using isotonic regression. BP indicates blood pressure; GP, general practitioner; and PEG, percutaneous endoscopic gastrotomy.

Discussion

In this study, we developed a fairly accurate machine learning model for estimating the individual risk of recurrent ischemic stroke within 1 year solely based on easily obtained patient data. To the best of our knowledge, this model is the first to reach objectively measured AUROC of 0.70 without depending on imaging data. No comprehensive clinical data were available—very limited medication information and no laboratory values or images were available. Information on all 20 variables necessary for the risk prediction can be collected via interviews with patients in the home care setting. The dynamic nature of our model makes it possible to recalculate the individual risk whenever new data become available.
To compare our approach with other prediction instruments targeting the same long-term prediction, we reviewed published statistical and machine learning studies and the established clinical risk scores. A meta-analysis estimated an AUROC range of 0.55 to 0.65 and 0.58 to 0.68 for ESRS and modified ESRS, respectively,7 which are the scores developed for the time horizon of 1 year. Our model showed better performance, however requiring 20 variables for computing prediction versus 8 and 11 of ESRS and modified ESRS, respectively. At the first glance, our best model might seem suboptimal compared with the published performance of the logistic regression (AUROC 0.74)15 and the artificial neural network (AUROC 0.77).17 The performance of both of those approaches, however, was not well established as it was measured on the same data used for training in the logistic regression approach and bias was introduced by manually balancing classes in the data before modeling and evaluation in the artificial neural network approach. Moreover, both approaches require more comprehensive features, which cannot be easily obtained from the patient at home. The logistic regression approach is dependent on both the clinical values and retinal imaging features. The artificial neural network approach requires demographic, clinical and medication data as well as features extracted manually from Doppler, CT, MR, or digital subtraction angiography.
We tried to avoid any algorithm selection bias and personal preferences by evaluating a comprehensive set of 25 different algorithms. Where applicable, we combined them with 3 techniques for treating the class imbalance problem, which resulted in 52 machine learning approaches tested. This is one of the major strengths of this study. The prediction performance obtained by the rigorous double-nested cross-validation method was additionally confirmed on the separate hold-out test set of patients. It is important to note that our validation method did not only validate the prediction model itself but also all other modeling steps included in the machine learning pipeline (Figure 1). By making sure that all follow-ups from a single patient are either in the training, or the test set of each double nested cross-validation fold, we prevented potential data leakage and made the performance estimation additionally conservative.
The learning curve for the best model confirmed that no serious overfitting took place. Moreover, it showed that the performance of the linear support vector machine model would likely improve slightly if more data were provided to the learning algorithm. After performance evaluation, the final prediction model was trained on all available data which was 28% larger than the original training sets. Therefore, it is reasonable to expect even slightly better performance on further unseen, future data.
By applying isotonic regression, the calibration curve of the prediction model was significantly improved. Nevertheless, it indicated a considerable deficit; namely, that the model overestimates the risk in mid and high probability range. However, from a clinical point of view, the risk overestimation is preferred to risk underestimation because false negatives are more critical than false positives.
The SHAP explainable AI framework revealed which variables were most important for the risk assessment. Although this is solely a model-specific indicator of the relative variable importance as measured in the hold-out test, it is worth noting that several widely considered stroke and recurrent stroke risk factors showed up among the top 10 predictors, such as patient age, systolic and diastolic blood pressure. The usefulness of these variables in recurrent risk stroke prediction was also validated in several clinical scores as well, such as the ESRS and the California risk scores. This gives confidence that the predictions computed by the prediction model are to a large extent based on known clinically relevant variables. Another informative plot of the SHAP framework is the summary plot, which shows estimated feature effects on the model output (Figure S3). The effects of several variables are intuitive, for example, the lack of poststroke rehabilitative measures, higher age, higher mean diastolic blood pressure, and the need for more General Practitioner’s (GP) visits are all recognized as risk-increasing factors. Several effects might seem counterintuitive at first: patient widowed status, percutaneous endoscopic gastrotomy (PEG), living in a house, and high mean systolic blood pressure are estimated as risk-decreasing factors. However, these could be interpreted as follows: widowed and patients with PEG might be getting more support from their surrounding (eg, by a family, friends, or a care nurse)—a potential confounding factor, while living in a house might be a sign of a higher socioeconomic status, which may correlate with better physical health in general. High mean systolic blood pressure as a risk-decreasing factor might be an artifact of certain drugs. For example, it is not uncommon that NSAIDS (potential further confounder) used for the prevention of recurrent stroke raise blood pressure. Other estimated risk-decreasing factors including high calorie diet and lower scores of the Mini Mental State Test are not easy to explain and might be an artifact of the imperfect model. The SHAP summary plot describes the behavior of the (imperfect) predictive model and not necessarily the causal relationships between variables.
Several important limitations of this study must be underlined. All data came from the same source (ESPro registry), and the algorithm was not validated externally. Despite significant calibration improvement using isotonic regression, the final calibration curve was still suboptimal due to risk overestimation. Variable importance and effects estimation was based on one method only, while ideally several methods could have been applied and compared. These technical points remain the subject of potential future work. Moreover, the number of available follow-ups was relatively small and the number of initially collected variables high (250). To focus on those variables, which are potentially clinically relevant, 2 stroke experts made a preselection of variables to be included in the analysis. In this process, a personal bias could not be excluded. While 4 of the top 5 binary variables in the final model (Table S2) are also selected in all 5 models of the outer double-nested cross-validation loop (Table S3), there is still a significant variability in χ2 variable selection. This can be attributed to the relatively small sample size. Finally, further bias might have been created by the inclusion of only expired patients in the analysis, which was one of the data privacy requirements in this study.

Conclusions

Our modeling study showed that a reasonably accurate individual prediction of recurrent stroke within 1 year from the patient interview is feasible using ESPro registry data. The developed model is dynamic and applicable at any time point when the necessary patient data become available, and not just at the timepoint of stroke onset. While related work indicates that possibly more accurate models could be developed using laboratory and imaging data, our prediction model can be used for regular and more frequent initial risk assessments in the patient’s home care setting, potentially as a part of web-based or mobile telehealth solution. Identified high-risk patients could be monitored more closely and potentially advised to consult their treating physician about necessary therapy adjustments. Such software tools could also be beneficial for providers who monitor their patients via structured telemonitoring programs. Risk prediction could enable patient risk stratification, empowering providers to focus on high-risk patients.

Article Information

Supplemental Material

Tables S1–S4
Figures S1–S3
Completed MI-CLAIM (2020) Checklist

Footnote

Nonstandard Abbreviations and Acronyms

AUROC
area under the receiver operating characteristic curve
ESPro
Erlangen Stroke Registry
ESRS
Essen Stroke Risk Score
SMOTE
synthetic minority oversampling technique

Supplemental Material

File (str_stroke-2021-036557_supp1.pdf)
File (str_stroke-2021-036557_supp2.pdf)
File (supplemental publication material 2_036557.pdf)

References

1.
Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, Abbasi-Kangevari M, Abbastabar H, Abd-Allah F, Abdelalim A, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Neurol. 2020;396:1204–1222. doi: 10.1016/S0140-6736(20)30925-9
2.
Kernan WN, Viscoli CM, Brass LM, Makuch RW, Sarrel PM, Roberts RS, Gent M, Rothwell P, Sacco RL, Liu RC, et al. The stroke prognosis instrument II (SPI-II): a clinical prediction instrument for patients with transient ischemia and nondisabling ischemic stroke. Stroke. 2000;31:456–462. doi: 10.1161/01.str.31.2.456
3.
Johnston SC, Gress DR, Browner WS, Sidney S. Short-term prognosis after emergency department diagnosis of TIA. JAMA. 2000;284:2901–2906. doi: 10.1001/jama.284.22.2901
4.
Ay H, Gungor L, Arsava EM, Rosand J, Vangel M, Benner T, Schwamm LH, Furie KL, Koroshetz WJ, Sorensen AG. A score to predict early risk of recurrence after ischemic stroke. Neurology. 2010;74:128–135. doi: 10.1212/WNL.0b013e3181ca9cff
5.
Diener HC, Ringleb PA, Savi P. Clopidogrel for the secondary prevention of stroke. Expert Opin Pharmacother. 2005;6:755–764. doi: 10.1517/14656566.6.5.755
6.
Sumi S, Origasa H, Houkin K, Terayama Y, Uchiyama S, Daida H, Shigematsu H, Goto S, Tanaka K, Miyamoto S, et al. A modified Essen stroke risk score for predicting recurrent cardiovascular events: development and validation. Int J Stroke. 2013;8:251–257. doi: 10.1111/j.1747-4949.2012.00841.x
7.
Chaudhary D, Abedi V, Li J, Schirmer CM, Griessenauer CJ, Zand R. Clinical risk score for predicting recurrence following a cerebral Ischemic event. Front Neurol. 2019;10:1106. doi: 10.3389/fneur.2019.01106
8.
Cho EB, Bang OY, Chung CS, Lee KH, Kim GM. Prediction of early ischemic stroke recurrence with multiparametric perfusion markers in symptomatic large artery disease. Cerebrovascular Diseases. 2013;35(suppl 3):582. Abstract. doi: 10.1159/000353129
9.
Paciaroni M, Agnelli G, Caso V, Tsivgoulis G, Furie KL, Tadi P, Becattini C, Falocci N, Zedde M, Abdul-Rahim AH, et al. Prediction of early recurrent thromboembolic event and major bleeding in patients with acute stroke and atrial fibrillation by a risk stratification schema: the ALESSA score study. Stroke. 2017;48:726–732. doi: 10.1161/STROKEAHA.116.015770
10.
Chandratheva A, Geraghty OC, Rothwell PM. Poor performance of current prognostic scores for early risk of recurrence after minor stroke. Stroke. 2011;42:632–637. doi: 10.1161/STROKEAHA.110.593301
11.
Thompson DD, Murray GD, Dennis M, Sudlow CL, Whiteley WN. Formal and informal prediction of recurrent stroke and myocardial infarction after stroke: a systematic review and evaluation of clinical prediction models in a new cohort. BMC Med. 2014;12:58. doi: 10.1186/1741-7015-12-58
12.
Mohan KM, Wolfe CD, Rudd AG, Heuschmann PU, Kolominsky-Rabas PL, Grieve AP. Risk and cumulative risk of stroke recurrence: a systematic review and meta-analysis. Stroke. 2011;42:1489–1494. doi: 10.1161/STROKEAHA.110.602615
13.
Zhang C, Zhao X, Wang C, Liu L, Ding Y, Akbary F, Pu Y, Zou X, Du W, Jing J, et al; Chinese IntraCranial AtheroSclerosis (CICAS) Study Group. Prediction factors of recurrent ischemic events in one year after minor stroke. PLoS One. 2015;10:e0120105. doi: 10.1371/journal.pone.0120105
14.
Zhang C, Wang Y, Zhao X, Liu L, Wang C, Pu Y, Zou X, Pan Y, Wong KS, Wang Y; Chinese IntraCranial AtheroSclerosis (CICAS) Study Group. Prediction of Recurrent Stroke or Transient Ischemic Attack After Noncardiogenic Posterior Circulation Ischemic Stroke. Stroke. 2017;48:1835–1841. doi: 10.1161/STROKEAHA.116.016285
15.
Yuanyuan Z, Jiaman W, Yimin Q, Haibo Y, Weiqu Y, Zhuoxin Y. Comparison of Prediction Models based on Risk Factors and Retinal Characteristics Associated with Recurrence One Year after Ischemic Stroke. J Stroke Cerebrovasc Dis. 2020;29:104581. doi: 10.1016/j.jstrokecerebrovasdis.2019.104581
16.
Park MH, Kwon DY, Jung JM. A machine learning approach in prediction of recurrent stroke. Stroke. 2019; 50:AWP530. Abstract. doi: 10.1161/str.50.suppl_1.WP530
17.
Chan KL, Leng X, Zhang W, Dong W, Qiu Q, Yang J, Soo Y, Wong KS, Leung TW, Liu J. Early identification of high-risk TIA or minor stroke using artificial neural network. Front Neurol. 2019;10:171. doi: 10.3389/fneur.2019.00171
18.
Rücker V, Heuschmann PU, O’Flaherty M, Weingärtner M, Hess M, Sedlak C, Schwab S, Kolominsky-Rabas PL. Twenty-year time trends in long-term case-fatality and recurrence rates after ischemic stroke stratified by etiology. Stroke. 2020;51:2778–2785. doi: 10.1161/STROKEAHA.120.029972
19.
Kelleher, JD, MacNamee, B, Darcy, A. Fundamentals of machine learning for predictive data analytics - algorithms, worked examples, and case studies. The MIT Press. Cambridge, Massachusetts; 2015.
20.
Heuschmann PU, Biegler MK, Busse O, Elsner S, Grau A, Hasenbein U, Hermanek P, Janzen RW, Kolominsky-Rabas PL, Kraywinkel K, et al. Development and implementation of evidence-based indicators for measuring quality of acute stroke care: the Quality Indicator Board of the German Stroke Registers Study Group (ADSR). Stroke. 2006;37:2573–2578. doi: 10.1161/01.STR.0000241086.92084.c0
21.
Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5:42. doi: 10.1186/s40537-018-0151-6
22.
Smolyakov, D, Sviridenko, N, Ishimtsev, V, Burikov, E, Burnaev, E. Learning ensembles of anomaly detectors on synthetic data. Lu, H, Tang, H, Wang, Z, eds. In: Advances in Neural Networks – ISNN 2019. Lecture Notes in Computer Science. 2019;292–306. doi: 10.48550/arXiv.1905.07892
23.
Holt JM, Wilk B, Birch CL, Brown DM, Gajapathy M, Moss AC, Sosonkina N, Wilk MA, Anderson JA, Harris JM, et al; Undiagnosed Diseases Network. VarSight: prioritizing clinically reported variants with binary classification algorithms. BMC Bioinformatics. 2019;20:496. doi: 10.1186/s12859-019-3026-8
24.
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. doi: 10.18637/jss.v045.i03
25.
Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997;1:67–82. doi: 10.1109/4235.585893
26.
Nilotpal C. Isotonic median regression: a linear programming approach. Math Oper Res. 1989;14:303–308. doi: 10.1287/moor.14.2.303
27.
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4768–4777. doi: 10.5555/3295222.3295230
28.
Cawley GC, Talbot NLC. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res. 2010;11:2079–2107.
29.
Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35. doi: 10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3
30.
Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R, Gianfrancesco M, Arnaout R, Kohane IS, Saria S, Topol E, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26:1320–1324. doi: 10.1038/s41591-020-1041-y

eLetters(0)

eLetters should relate to an article recently published in the journal and are not a forum for providing unpublished data. Comments are reviewed for appropriate use of tone and language. Comments are not peer-reviewed. Acceptable comments are posted to the journal website only. Comments are not published in an issue and are not indexed in PubMed. Comments should be no longer than 500 words and will only be posted online. References are limited to 10. Authors of the article cited in the comment will be invited to reply, as appropriate.

Comments and feedback on AHA/ASA Scientific Statements and Guidelines should be directed to the AHA/ASA Manuscript Oversight Committee via its Correspondence page.

Information & Authors

Information

Published In

Go to Stroke
Stroke
Pages: 2299 - 2306
PubMed: 35360927

Versions

You are viewing the most recent version of this article.

History

Received: 6 July 2021
Revision received: 12 January 2022
Accepted: 17 February 2022
Published online: 1 April 2022
Published in print: July 2022

Permissions

Request permissions for this article.

Keywords

  1. ischemic stroke
  2. machine learning
  3. probability
  4. recurrence
  5. registries

Subjects

Authors

Affiliations

Asmir Vodencarevic, PhD [email protected]
Digital Health, Siemens Healthcare GmbH, Erlangen, Germany (A.V.).
Michael Weingärtner
Interdisciplinary Center for Health Technology Assessment (HTA) and Public Health, Friedrich-Alexander University Erlangen-Nürnberg, Germany (M.W.).
Department of Epidemiology and Biostatistics, McGill University, Montreal, Quebec, Canada (J.J.C.).
Health Policy, London School of Economics, United Kingdom (J.J.C.).
Computed Tomography, Siemens Healthcare GmbH, Forchheim, Germany (D.U.).
Marcus Zimmermann-Rittereiser, Dipl-Ing, MBM
Digital Health, Siemens Healthcare GmbH, Erlangen, Germany (M.Z.-R.).
Department of Neurology, University Hospital Erlangen, Germany (S.S.).
Peter Kolominsky-Rabas, MD, PhD, MBA https://orcid.org/0000-0002-7168-058X
Interdisciplinary Center for Health Technology Assessment and Public Health, Friedrich-Alexander University Erlangen-Nürnberg, Germany (P.K.-R.).

Notes

Supplemental Material is available at Supplemental Material.
For Sources of Funding and Disclosures, see page 2305.
Correspondence to: Asmir Vodencarevic, PhD, Novartis Pharma GmbH, Roonstr. 25, 90429 Nuremberg, Germany. Email [email protected]

Disclosures

Disclosures M. Zimmermann-Rittereiser is an employee and a shareholder of Siemens Healthcare GmbH. D. Ukalovic is an employee of Siemens Healthcare GmbH and a shareholder of Mind Medicine and BioNTech. Dr Vodencarevic is an employee of Novartis Pharma GmbH. J.J. Caro is an employee of Evidera. The other authors report no conflicts.

Sources of Funding

The data collection in the Erlangen Stroke Registry is supported (Grant number ZMV I 1-2520KEU305) by the German Federal Ministry of Health (BMG) as part of the National Information System of the Federal Health Monitoring (Gesundheitsberichterstattung des Bundes—GBE). Siemens Healthcare GmbH funded this modeling study and contributed to its design, literature search, data analysis, and writing of the report.

Metrics & Citations

Metrics

Citations

Download Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Select your manager software from the list below and click Download.

  1. An interpretable hybrid machine learning approach for predicting three-month unfavorable outcomes in patients with acute ischemic stroke, International Journal of Medical Informatics, 196, (105807), (2025).https://doi.org/10.1016/j.ijmedinf.2025.105807
    Crossref
  2. Application of Artificial Intelligence in Acute Ischemic Stroke: A Scoping Review, Neurointervention, 20, 1, (4-14), (2025).https://doi.org/10.5469/neuroint.2025.00052
    Crossref
  3. An interpretable machine learning scoring tool for estimating time to recurrence readmissions in stroke patients, International Journal of Medical Informatics, 194, (105704), (2025).https://doi.org/10.1016/j.ijmedinf.2024.105704
    Crossref
  4. Associations of Social, Behavioral, and Clinical Factors With Sex Differences in Stroke Recurrence and Poststroke Mortality, Circulation: Cardiovascular Quality and Outcomes, 18, 2, (e011082), (2025)./doi/10.1161/CIRCOUTCOMES.124.011082
    Abstract
  5. Recurrent prediction within 1, 3, and 5 years after acute ischemic stroke based on machine learning using 10 years J-ASPECT studyJ-ASPECT Study 10年間の日本全国DPCデータを用いた機械学習による急性期脳梗塞発症後の1,3,5年以内の再発予測, Japanese Journal of Stroke, 47, 1, (17-24), (2025).https://doi.org/10.3995/jstroke.11264
    Crossref
  6. Risk factors and prediction models for recurrent acute ischemic stroke: a retrospective analysis, PeerJ, 12, (e18605), (2024).https://doi.org/10.7717/peerj.18605
    Crossref
  7. Predictors of affective disturbances and cognitive impairment following small spontaneous supratentorial intracerebral hemorrhage, European Journal of Neurology, 32, 1, (2024).https://doi.org/10.1111/ene.16544
    Crossref
  8. The performance of machine learning for predicting the recurrent stroke: a systematic review and meta-analysis on 24,350 patients, Acta Neurologica Belgica, 125, 3, (609-624), (2024).https://doi.org/10.1007/s13760-024-02682-y
    Crossref
  9. Explainable Machine Learning Models for Brain Diseases: Insights from a Systematic Review, Neurology International, 16, 6, (1285-1307), (2024).https://doi.org/10.3390/neurolint16060098
    Crossref
  10. The most efficient machine learning algorithms in stroke prediction: A systematic review, Health Science Reports, 7, 10, (2024).https://doi.org/10.1002/hsr2.70062
    Crossref
  11. See more
Loading...

View Options

View options

PDF and All Supplements

Download PDF and All Supplements

PDF/EPUB

View PDF/EPUB
Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login
Purchase Options

Purchase this article to access the full text.

Purchase access to this article for 24 hours

Prediction of Recurrent Ischemic Stroke Using Registry Data and Machine Learning Methods: The Erlangen Stroke Registry
Stroke
  • Vol. 53
  • No. 7

Purchase access to this journal for 24 hours

Stroke
  • Vol. 53
  • No. 7
Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

Figures

Tables

Media

Share

Share

Share article link

Share

Comment Response