Predicting Falls in the Nursing Homes via Recurrent Neural Network Models

Olga V Kravchenko; Peng Jin; Xiaoyue Shan; Eddie Perez; Paul Munro; Richard D Boyce

doi:10.2217/001c.122435

Plain Language Summary

This study aimed to develop and evaluate recurrent neural network (RNN)-based algorithms for predicting fall probability in patients at long-term skilled nursing facilities, using the Long-Term Care Minimum Data Set (MDS) 3.0 and prescription drug exposure records from the five facilities in Western Pennsylvania. Models were trained to predict falls within 90 days from the day of a completed MDS assessment using the RNN, long-short-term model (LSTM) and gated recurrent units (GRU) architectures. Results were contrasted against previously evaluated hybrid classification and regression trees-logistic regression (CART-logit) model. A ϕK correlation coefficient analysis identified variables that correlated with accurate fall predictions in neural network models but not in the CART-logit model.

RNN, LSTM and GRU showed similar performance (AUROC ≈ 0.74±0.1), slight variations are due to different imputation techniques, outperforming CART-logit model (AUROC = 0.67). Feature analysis identified significant correlation for the following variables: delirium scale (ϕK = 0.63), use of the antipsychotic medication (ϕK = 0.54), exposure to psychotropic medication (ϕK = 0.56) and the cumulative number of days spent in the facility (ϕK = 0.54).

All three models performed significantly better than CART-logit model. This sets up RNN models as the state of the art for the nursing home fall prediction. However, there was only a negligible difference in performance between RNN, GRU and LSTM which we think is due to the fact that our experiment used the MDS events as sequence steps.

Tweetable Abstract

New study: RNN models outperform CART-logit in predicting nursing home falls, highlighting the importance of temporal data.

Introduction

Falls are the leading cause of injury among older adults, making them a significant concern in healthcare. Studies show that nearly a third of adults 65 years and older living in the community experience a fall in any given year.^1,2 This risk is even higher for residents in long-term care and skilled nursing facilities, where an estimated 1.7 falls occur per bed per person-year. This means that in a 100-bed facility, a fall may happen nearly every other day.^3,4

The risk factors for falls encountered by nursing home residents include environmental hazards, underlying health conditions and adverse drug effects.⁵ Many falls may be prevented if the individual risk factors for each nursing home patient can be identified and addressed promptly.⁶ For instance, falls, associated with adverse drug events can be prevented in up to 51% of cases, including up to 72% of the fatal, serious or life-threatening falls.⁷

To implement a practical preventative approach, a multicomponent intervention strategy is crucial. This strategy should involve comprehensive chart reviews, medication adjustments, enhanced staff training, and other appropriate measures. It is essential to tailor this approach to individual patients and their specific risk factors. Identifying patients at an increased risk of a fall and understanding the contributing factors is vital for timely interventions and developing best practices in monitoring these patients. A validated model that utilizes nursing home data to predict the probability of falls in the near future can be a valuable component of a safety monitoring system, providing clinicians with informative, patient-specific and actionable alerts.

In our recently published work, we developed and validated a novel approach for predicting falls within the 90-day window in the nursing homes. This approach utilizes the long-term care minimum data set (MDS) and drug dispensing and administration data, typically available in the electronic form in the nursing homes across the USA.⁸ Our approach is based on a classification and regression trees-logistic regression (CART-logit) model, which predicts fall within 90 days of completing an MDS report for a patient who has been in the facility for at least 7 days. The model achieved an AUROC of 0.668 (95% CI: 0.643–0.693) on the validation subsample and offers a straightforward and transparent representation of the decision pathway. It represents an advancement over the 22 fall risk assessment tools previously evaluated in the nursing home setting as it demonstrates better performance characteristics for the fall prediction window of ≤ 90 days and is the only model designed to utilize features easily obtainable at nearly every facility in the USA.

One limitation of the CART-logit model is that it relies only on data available at the time of an MDS report and does not account for changes in a patient’s condition that may develop over time. In this work, we explored using neural network-based algorithms that have the potential to consider the complex temporal dependencies present in a sequence of MDS data available for a patient to potentially improve fall prediction compared with the CART-logit model. One type of network particularly suited for handling sequential or temporal data is a recurrent neural network (RNN). RNNs have proven successful in leveraging longitudinal electronic health records data for multi-label diagnostic predictions⁹ and predicting 30-day rehospitalization.¹⁰ To our knowledge, our study is the first known attempt to apply RNNs to predict falls using longitudinal data collected in the nursing home setting.

Objectives

This study aimed to develop and evaluate several approaches using RNNs for predicting falls by leveraging MDS data in combination with drug dispensing and administration data. As the MDS data comprises multiple assessments conducted over time, we hypothesized that the RNN models would generally outperform the previously tested CART-logit model by effectively incorporating temporal aspects of the data set. To achieve this, we selected three RNN architectures, namely the basic RNN, long short-term memory (LSTM) and gated recurrent unit (GRU), to compare their performance to CART-logit model and to determine the most effective approach. As an exploratory goal, we sought to gain deeper insights into the features utilized by RNN models that are not accounted for by the CART-logit model.

Methods

Data Sources

The long-term care MDS is a comprehensive health survey that is completed by trained staff for every skilled nursing and long-term care patient in any Medicare-certified nursing home (NH) in the USA.^11,12 MDS data are collected to satisfy government regulatory requirements, calculate facility quality measures and accurately bill for health services. The data set comprises more than 200 variables, including information on assessment type, facility details, patient demographics (such as age, race and marital status), cognitive status (assessed using the Brief Interview of Mental Status¹³), functional status (measured using Katz Activities of Daily Living instrument¹⁴), depression rating (using the Patient Health Questionnaire¹⁵), delirium status (evaluated with the Confusion Assessment Method¹⁶), behavioral status, wandering status, pain status, chronic condition diagnoses, acute condition diagnoses, history of falls, history of injurious falls and exposure to various drugs, such as antidepressants, antipsychotics, anticoagulants and diuretics.

The data for this study were obtained from two time-separated extracts of the MDS 3.0 for Nursing Homes and Swing Bed Providers,¹⁷ both sourced from the University of Pittsburgh Medical Center Senior Communities nursing homes.⁸ Briefly, the first data set included all residents from 2011 to 2013, while the second data set comprised all residents from 2016 to 2018. The gap between the data sets was due to changes in funding for the project and the systems used for MDS data extraction for research. The data set was linked to drug dispensing data for the earlier period and drug administration data for the latter period. The dispensing data included information such as order start and stop dates, and drug identifiers from RxNorm and the Anatomical Therapeutic Chemical Classification System.

The original data set comprised MDS records collected upon patient admission to the NH (entry-tracking records), at the time of discharge (discharge records) and at specified intervals in between (not entry/discharge records). Each MDS record was linked to several variables indicating the antidepressant, antipsychotic and sedative hypnotic drugs dispensed to the patient on the day of the MDS assessment. These variables included the Anatomical Therapeutic Chemical code for each drug, the standardized daily dose and the total count of psychotropic drugs. Note that MDS is designed so that entry-tracking and discharge MDS record types do not include the complete set of variables mentioned above. All data were de-identified to HIPAA limited data set requirements prior to analysis. The University of Pittsburgh, Institutional Review Board, approved the study.

Outcome Measure

The outcome measure used in this study was the MDS data field J1800 (fall since the admission or prior MDS assessment). Previous evidence suggested that J1800 provides more complete data and captured a higher percentage of injurious falls, compared with J1900C (falls with major injury since admission or the prior MDS assessment).¹⁸ The study found that J1800 captured 67.8% (White patients) and 62.6% (non-White patients) of major injury falls for CMS short-stay residents (i.e., those with a stay of ≤ 100 days without a gap in residence > 30 days). Additionally, for CMS long-stay residents, J800 captured 82.8% of major injury falls in White patients and 76.1% of major injury falls in non-White patients.

Data Preprocessing and Study Design

Only the ‘not entry/discharge’ records were utilized in this study since these had the most complete data about a given patient at the time the records were completed by nurse coordinators. The data set comprised 208 features, with 144 of them being categorical. Ordinal features were encoded based on their order in the MDS survey, while categorical features were encoded using one-hot encoding. Date-related features, such as NH entry start and end dates, were converted to the range of days between those dates.

Among the features, 126 had at least one missing value, with 41 features having more than 30% missing values. Two types of missing values were identified. The first type was attributed to the ‘skip-logic’ in the MDS survey, which caused certain MDS questions to be skipped depending on the answers provided to other questions. The second type of missing values resulted from random errors, such as human factors, or unknown reasons. Both types of missing values were handled consistently throughout the data set. Three approaches were employed to address missing values, leading to three experimental designs, which are described as follows.

The first approach involved excluding all features with at least one missing value due to random error, resulting in a remaining total of 99 features.

In the second approach, we transformed the non-missing value $\upsilon$ of the numeric features to a two-dimensional vector ${(\upsilon}_{x},\ \upsilon_{y})$ on a unit circle, where $\upsilon_{x} = \cos{\alpha,\ \ }{\ and\ \upsilon}_{y} = \sin\alpha$ . The angle α was computed as

$\alpha = \frac{\pi}{2}\frac{\upsilon - l}{u - l}\tag{1}$

Here, u and l are the maximum and minimum possible values of the feature, respectively.

$\upsilon_{x} = \cos{\alpha,\ \ }{\ and\ \upsilon}_{y} = \sin\alpha$ represent the mapped values. The origin (0,0) was used to represent missing values so that every non-missing value maintained the same Euclidean distance from missing values. Label encoding was applied before the imputation technique described above for ordinal features. Categorical features were handled by using all zeros to represent missing values during one-hot encoding. The resulting data set encompassed various behavioral features, acute and chronic conditions, and psychotropic drugs, encoded using a one-hot representation.

In the third approach, feature selection was conducted by combining the Gini impurity measure of feature importance with the fall risk factors previously reported in the epidemiology literature. The Gini impurity measure is a statistical metric used to quantify the extent of impurity or disorder within a data set. It assesses the predictive power of each feature by evaluating the purity of the classes associated with that feature. Features with a higher Gini impurity value indicate a greater level of disorder and lower predictive power. Utilizing the Gini impurity measure combined with the literature-derived risk factors, the feature selection algorithm identifies the most informative features for predicting falls while ensuring that the important clinical factors associated with falls are not disregarded.

For each machine learning model, three experiments were carried out, each utilizing one of the three imputation techniques described earlier to account for the missing data. Figure 1 provides a visual representation of the experimental setup.

Machine Learning Model Development and Evaluation

Three models utilized in this work (RNN, LSTM and GRU) belong to a class of artificial neural networks designed to incorporate memory over time steps, enabling them to retain and utilize input information effectively. As a result, they have proven to be particularly suitable for tasks involving sequential data and time series analysis.^19–21

At the core of an RNN is a hidden vector h, which undergoes updates at each time step t, according to:

$h_{t} = \tanh\left( \mathcal{W}\left\lbrack \mathcal{x}_{t},h_{t - 1} \right\rbrack + b \right)]\tag{2}$

Here, h_t₋₁ and h_t respectively represent the hidden state at time steps t − 1 and t, $\mathcal{W}$ is the weight matrix and b is the bias vector. As shown in Equation 2, the hidden state is decided by the current input of the prior state. Consequently, the network gains the ability to accumulate knowledge about the sequential data it processes.

LSTM and GRU networks were developed to enhance the basic RNN approach to overcome the vanishing gradient problem. This problem arises when the impact of the previous layer on the subsequent layer diminishes significantly, hampering effective learning.²² To tackle the vanishing gradient problem, these neural networks use ‘gates’ that regulate the flow of information in the network. These gates assess the importance of inputs in the sequence and store relevant information in the memory unit. GRU employs two gates: the reset gate and the update gate.²³ LSTM adds two additional gates: the forget gate and the output gate.²⁴

At each time step $t$ , the network receives an input vector $x_{t}$ representing the patient’s record. As the number of features is relatively high, a dropout technique was applied in the input layer to reduce overfitting. Then, the input data was passed through the neural network layer with N hidden units, allowing for the stacking of multiple layers. The output of the layer(s), represented by the vector h_t of length N, was then passed through a sigmoid function to generate the prediction $\widehat{y_{t}}$ ranging between 0 and 1, indicating the probability of a patient falling. The output layer is described as

$\widehat{y_{t}} = \sigma\left( \omega_{h}h_{t} + b_{h} \right)\tag{3}$

Here, $\omega_{h}$ is the weight vector and $b_{h}$ is the bias. The parameters updated during the training phase are $\mathcal{W}$ (including $\mathcal{W}_{f}$ , $\mathcal{W}_{i},\ \mathcal{W}_{o}$ and $\mathcal{W}_{c}$ ), $b$ (including $b_{f},\ b_{i},\ b_{o},\ b_{c}$ ), $\omega_{h}$ and $b_{h}$ . Since it is a binary classification problem, binary cross-entropy is employed as the loss function:

$L\left( \mathcal{W,}b,\ \omega_{h},b_{h},X,y \right) = \sum_{t = 1}^{n}{- \left( y_{t}\log\widehat{y_{t}} + \left( 1 - y_{t} \right)\log\left( 1 - \widehat{y_{t}} \right) \right)}\tag{4}$
Here, $X$ is the input matrix consisting of input vectors $x_{t}\mathbf{\ }$ at each time step $t$ , and $y$ is the vector of actual labels $y_{t}$ at every time step $t$ .

For each of the three models, when training the network, the data set was split into batches, with each batch comprising all the records for a single patient. Within each batch, X consisted of the records of one patient, sorted by the acquisition date of the MDS data (predictor-date) and the time step t corresponded to the order number of the record. The records were fed into the network as a sequence and made predictions at every time step.

After training each batch, cross-entropy loss was calculated to update the parameters. The cell state and hidden states were then reset to receive the following sequence. Since our task involved supervised learning, the input at each time step included the current features and the label from the previous time step. This allows the model to capture temporal dependencies and patterns in the data.

We represent the label at time step t as ${\widetilde{y}}_{t}$ . Combining the current input x_t with the label ${\widetilde{y}}_{t}$ , we obtained the input vector $\left\lbrack \mathbf{x}_{\mathbf{t}},{\widetilde{y}}_{t} \right\rbrack$ . Given that no label is available before the first time step, we split ${\widetilde{y}}_{t}$ into two features using one-hot encoding. As a result, the input vector for every first time step became $\left\lbrack \mathbf{x}_{\mathbf{1}},(0,0) \right\rbrack$ , indicating the current input with no previous label.

Performance Metrics

We split the data set 70%/30% for all experiments based on the de-identified patient identifier, for training and testing, respectively. To optimize the performance of each model, we conducted five-fold cross-validation on the training set to fine tune the hyperparameters, aiming to maximize the F₁ score on the test set. The F₁ score was selected as the target metric because it balances recall and precision in classifying true cases.

To evaluate the performance of each model, we employed several metrics on the test data, including AUROC, F₁ score, precision, recall and specificity. The area under the receiver operating characteristic curve (AUROC) was utilized to assess how well each model distinguished between the two prediction classes across all probability thresholds, providing a comprehensive understanding of the model’s discriminatory power.

Feature Attention Comparison Between the Prior CART-Logit and RNNs

A feature analysis was conducted using ϕK to identify variables correlated with fall predictions that were correct in the best-performing recurrent neural model but incorrect in the CART-logit model. This coefficient is an extension of the ϕ coefficient that is commonly used to measure the association between two binary variables. The ϕK coefficient allows for the calculation of correlation between variables with more than two categories. Since both projects utilized the same data set from the same patients over the same time period, we extracted data from two subsets: an overlapping subset of patients where both neural and CART-logit fall predictions were correct and, a subset of patients where neural fall predictions were correct, but the CART-logit predictions were incorrect. To facilitate the comparison, we added columns that identified whether the prediction was correct, with values 1 (fall) and 0 (no fall), and a column that indicated whether a prediction was true or false. We then employed the ϕK correlation coefficient to test for correlation between accurate predictions and various model features. This test generates a pairwise correlation report, indicating the significance of the correlation scores. We considered features with a score above 0.8 to be strongly correlated, while features with a correlation score above 0.5 were considered more relaxed but still indicative of correlation with the correct predictions made by the recurrent neural model. This analysis provided some insights into the features that potentially played a role in more accurate fall prediction by the neural network model.

Experiment Environment

All machine learning models were implemented using Python version 3.7.3, Scikit-learn version 0.21.2, and Keras version 2.2.4 with TensorFlow version 1.14.0 as the backend engine. These packages were installed from Anaconda 4.10.0 running on Ubuntu 20.04.5 LTS. Jupyter notebooks of the analysis are available at https://github.com/dbmi-pitt/Geri-DL

Results

Patient Characteristics

The statistical summary of the data used in this study is presented in Table 1. There were 10,898 ‘not entry/discharge’ records for 3985 patients. Of these, 18.35% of records indicate that the patient had a fall since prior MDS assessment, which corresponds to 22.63% of patients who experienced at least one fall since prior MDS assessment. The maximum number of records per patient was 19 and the average was 2.42.

Table 1.The data set used to build and test the model. The data set combines extracts from the years 2011–2013 and 2016–2018.

Total number of records	10,898
Total number of patients	3985
Average number of records per patient	2.74
Maximum number of records per patient	19
Total number (%) of records with fall incidents	1392 (18.35%)
Total number (%) of patients with fall incidents	709 (22.63%)
Average (median) age of patients	77 (80)
Percentage of female patients	64.41%
Percentage of Whites	82.13%
Percentage of African American	15.42%
Percentage of other races	2.46%
Percentage of patients with cognitive impairments	13.72%
Percentage of patients with impaired functional status	91.54%
Percentage of patients with Alzheimer’s	11.71%
Percentage of patients with impaired transfer	88.13%

Machine Learning Model Performance

Table 2 shows complete performance characteristics for each of the three experiments conducted for RNN, LSTM and GRU models. Performance characteristics for CART-logit model used on the same data set in the prior work were added at the bottom of the table, for comparison. Overall, the RNN, LSTM and GRU models performed better than the CART-logit model, achieving greater AUROC values and demonstrating competitive PPV, sensitivity, specificity and F-measure. RNN, LSTM and GRU models demonstrated similar AUROC values, indicating their ability to distinguish between the two prediction classes. The three neural network-based models also exhibited similar values for PPV, sensitivity, specificity and F-measure, suggesting comparable performance in terms of classification accuracy, precision, recall and balance between the two metrics. However, there are slight variations in performance among the experiments within each model type. Table 2 also shows the threshold utilized for the CART-logit and for neural network-based models. For CART-logit, the threshold was calculated using Youden’s index, which optimizes the balance between sensitivity and specificity.⁸ In all experiments of RNN, LSTM and GRU, the optimal thresholds were computed by maximizing F1 measure which was concordant with how we identified the hyperparameter settings for the three RNN models.

Table 2.Key performance metrics of the neural network-based models RNN, LSTM and GRU, with the previously studied CART-logit model.

Model	AUROC	PPV	Sensitivity	Specificity	F-measure	Threshold
Simple RNN Exp 1	0.73	0.38	0.51	0.83	0.43	0.16
Simple RNN Exp 2	0.74	0.36	0.56	0.80	0.43	0.17
Simple RNN Exp 3	0.74	0.37	0.53	0.82	0.43	0.17
LSTM Exp 1	0.74	0.39	0.53	0.84	0.44	0.20
LSTM Exp 2	0.72	0.36	0.51	0.82	0.42	0.11
LSTM Exp 3	0.74	0.37	0.54	0.82	0.43	0.16
GRU Exp 1	0.75	0.39	0.53	0.84	0.44	0.22
GRU Exp 2	0.74	0.36	0.54	0.80	0.42	0.13
GRU Exp 3	0.74	0.36	0.54	0.81	0.43	0.14
CART-logit	0.668	0.25	0.57	0.69	0.35	0.18

For each model, three experiments were conducted with the data sets that differ by the approach taken to account for the missing values.

Feature Attention Comparison Between the CART-Logit and RNNs

Result of the ϕK test showed that true predictions exhibited a strong positive correlation with a feature indicating the sudden stop of psychotropic drugs (ϕK = 1.0), and a weaker but noticeable correlation with a feature representing cognitive scale (ϕK = 0.6).

To explore whether the MDS cognitive scale affects the predictions of the CART-logit model we conducted a post-hoc change to the model that added this feature in to the logistic regression part of the CART-logit. Table 3 shows comparison of the CART-logit performance model before and after addition of the cognitive scale to the logistic regression analysis. The inclusion of the cognitive scale in the logistic regression model resulted in an improvement in sensitivity; however, it led to a decrease in all other evaluation metrics.

Table 3.Comparison of the CART-logit performance model after adding features identified during feature analysis.

Model	AUROC	PPV	Sensitivity	Specificity	F-measure	Threshold
CART-logit	0.668	0.25	0.57	0.69	0.35	0.18
+ Cognitive scale	0.639	0.18	0.68	0.54	0.28	0.14

Discussion

Principal Findings

The principle finding of this study was that all three RNN models demonstrated notable improvement compared with the CART-logit model in each of the nine experiments across all evaluation metrics. The evaluation results clearly indicate that RNN, LSTM and GRU, exhibited better performance across multiple evaluation metrics, apart from sensitivity, that was slightly higher in the CART-logit model. This suggests that by leveraging their ability to capture sequential patterns, these recurrent network models achieved better performance predicting fall risks among nursing home patients.

The performance of the RNN, LSTM and GRU models was found to be quite similar, with some variations in metrics among different experiments of the same model, representing the difference in treatment of the missing data. This outcome supports our initial hypothesis, indicating that RNN classifiers can be effectively utilized with MDS data, exhibiting solid performance characteristics.

Variations in performance characteristics and thresholds among three experiments for each model can be explained by different approaches to handling missing data. This can lead to variations in the distribution and characteristics of the data used for training the models and result in different performance metrics and optimized thresholds. Since the differences between the three approaches are really small, no inferences about advantages of each approach can be made.

The RNN models exhibiting a higher PPV compared with CART-logit models means that RNN models improved correct prediction rate from approximately 1 in 4, as in CART-logit, to 1 in 3, thus reducing fatigue associated with false positives. The greater specificity of the RNN models shows a better ability to distinguish between patients who will fall and those who will not, reflecting a reduction of false negatives from approximately 31 to 17%. Clinically, it means that less patients who are at risk of falling are incorrectly identified as not being at risk, and thus more patients will receive timely intervention measures. Although sensitivity of the best-performing RNN model is lower than that of the CART-logit model, the overall F-measure indicates that the RNN is a better model when considering both PPV and sensitivity equally, thus making it possible to provide mode reliable predictions in a clinical setting. Overall, these improvements suggest a more efficient use of clinical resources and potentially less strain on healthcare providers.

It is helpful to compare the results of this study, which focuses on machine learning methods for fall prediction, with traditional paper-based fall risk scores because the latter might be simpler and less expensive to implement in many facilities than the former. There are a few studies evaluating low tech/low-cost fall risk assessment methods.²⁵ The most recent study we can find is from 2022 and it evaluated four approaches for predicting falls in nursing facilities–simple tracking of fall history, staff clinical judgment, the Care Home Falls Screen (CaHFRiS) and the Fall Risk Classification Algorithm (FRiCA).²⁶ They evaluated the PPV, sensitivity and specificity at the 1, 3 and 6-month prediction windows. Focusing on the 3-month window, which is the maximum window of our algorithms, PPV ranged from 48 to 52%, sensitivity from 53 to 76%, and specificity from 57 to 74%. In comparison, the best PPV, sensitivity and specificity of the algorithms we tested were 39, 56 and 84%, respectively. This means that the low-tech/low-cost methods had a better PPV, comparable sensitivity and worse specificity than the machine learning algorithms. A more robust comparison is with AUROC metric. The CaHFRiS was the only fall risk score for which an AUROC was calculated, and the best performance for the 3-month prediction window was 0.68. This is comparable to the CART-logit (0.67) from our prior work but lower than the worst performing RNN (0.72) in the current study. In summary, the machine learning algorithms were as good or better than low-tech/low-cost methods on most measures, including AUROC. However, machine learning algorithms have the advantage of being more automated which could lead to monitoring of a larger population more quickly and consistency than paper-based methods.

Translation of a fall prediction algorithm to clinical practice requires consideration of the appropriate placement in the clinical workflow, proper integration into the electronic health records system and further evaluation. As a result, we can only speculate about the potential impact of the models examined in this study on falls prevention. The rate of falls in the nursing home setting is estimated to range from 1.5 to 5 falls per bed per year.^27–30 This means that a nursing home with 50 beds will have between 75 and 250 fall events per year. A recent clinical practice guideline for fall prevention in nursing homes reported that interventions comprising more than two fall prevention measures were shown to reduce falls by an average of 14%.⁵ Assuming that the best-performing RNN (based on AUROC and F-measure) from our study is implemented as a trigger for an evidence-based multifactorial intervention with average performance, we can multiply the sensitivity of the algorithm (53%) times the average intervention prevention performance (14%) to estimate that about 7% (5–18 fall events per year) of the falls would be predicted in advance and avoided. This estimate, while very hypothetical, illustrates the importance of continuing research on fall prediction algorithms for nursing home patients to achieve high sensitivity, PPV and specificity.

Explainability

In exploring model explainability, we initially considered using Explainable AI (XAI) methods implemented in the SHAP python library, such as DeepSHAP, to compare the features estimated to be most important across all our predictive models.^31,32 However, we found that these XAI methods cannot be applied without altering original study design, as they do not accommodate the temporal dynamics of our data and would therefore have a high chance of outputting incorrect feature importance values.³³ Further on, we attempted to use two XAI methods designed for time-series models, ‘Feature Importance In Time’ (FIT) and ‘Temporal Importance Model Explanation’ (TIME).^34,35 However, these XAI methods only work for models that predict a single-time step at a time and were thus unsuitable for this study.^36,37

As an alternative way of looking at feature attention comparison between the CART-logit and RNNs, we explored the relationship between the LSTM model’s features and its accurate predictions using the ϕK coefficient. Our findings revealed a strong positive correlation between true predictions of the LSTM model and a feature indicating the sudden stop of psychotropic drugs, as well as a noticeable correlation between true predictions and a feature representing cognitive scale. However, since the feature indicating the sudden stop of drugs has many missing values and is only available for patients who experienced the transition from a different facility, we believe that this correlation is an anomaly; therefore we have excluded this feature from further analysis. Furthermore, we have added the cognitive scale in the logistic regression part of the CART-logit model, which resulted in improved sensitivity though led to a decrease in all other evaluation metrics. This suggests that the RNNs learn a more nuanced model of fall risks that accounts for sequences of events.

Our primary focus in this study was to improve predictive power of the CART-logit model by adding features that potentially contributed to more accurate RNN predictions, rather than systematically exploring all features that affected RNN results; thus, we analyzed the subset where RNN predictions were correct while CART-logit were incorrect, and not the other way around, because CART-logit predictions were already explainable. A more comprehensive exploration of the interpretability of neural network-based models may reveal valuable insights into the relative importance of features within MDS data sets, particularly regarding changes occurring over time.

Limitations

The study has several limitations that should be taken into consideration, especially regarding the potential application of the models to clinical care. First, although the models demonstrated promising performance in predicting falls, the data set utilized in this study came from the same health care system and prior to the COVID pandemic. Therefore, the performance of these models must be tested before deployment in other healthcare systems.

Another limitation relates to the quality and completeness of the input data. The accuracy of the predictions relies on the accuracy and reliability of the data collected, which can be influenced by variations in how data is recorded across different facilities. Inconsistent or missing data can impact the performance of the models. Furthermore, the models are dependent on the availability of relevant and comprehensive features, and the selection of features used in this study may not capture all the important factors contributing to fall risk in clinical practice. For example, the MDS data set does not contain data from wearable devices, which showed promising results for predicting falls in recent years.^38,39 However, it is important to emphasize, that the MDS data set was chosen for this study due to its availability in the electronic form in nearly every nursing home across the USA.

In terms of model explainability, a significant challenge arises from the black-box nature of neural network models. These models often lack transparency, creating a barrier to their adoption in clinical care, where transparency and understanding of the decision-making process are crucial for gaining trust and acceptance from healthcare providers.

Future Perspective

There are several aspects that we plan to address in our future work. One of them is the limited number of records per patient due to the scheduling of MDS assessments. We hypothesize that incorporating information on drug prescribing changes that occur between consecutive records will expand the data set and enable the models to learn from longer sequences of patient information, potentially enhancing its predictive capabilities. As well, research is needed on how to incorporate entry-tracking records, which only contain partial patient status information. There are two approaches to consider: the first approach involves utilizing an auto-encoder to impute missing values in entry-tracking records, allowing us to derive a more complete representation of the patient’s status. The second approach is to use an embedding layer to capture the essential information while minimizing the impact of missing values.

Another aspect is explainability of the neural network-based models for fall prediction. One promising method, as demonstrated by Marijn Valk²³ involves implementing the neuron attention mechanism, which highlights the relevant features in the model’s decision-making process. Additionally, visualization techniques can be explored to provide a clearer understanding of the hidden representations within the neural network-based models.^36,37

Conclusion

Multiple RNN models were employed to predict which patients are likely to fall within 90 days of their last MDS assessment. The results showed that the RNN, LSTM and GRU models outperformed the previously evaluated CART-logit model, emphasizing the significance of incorporating temporal aspects in fall prediction.

Summary Points

Falls can be prevented if individual risk factors are identified and intervened upon promptly.
The MDS data set, available in the electronic form in Medicare-certified nursing homes across the USA, enables the deployment of fall prediction models at the point of care.
The CART-logit model, previously developed by our research group, showed robust performance (AUROC = 0.67) but did not account for the temporal aspects of patient data.
We developed neural-network-based models that outperform CART-logit model with AUROC ≈ 0.74 ± 0.1, while accounting for changes in the patient’s health that occur over time.
Feature analysis showed a correlation between accurate predictions and various model features, identifying delirium scale (ϕK = 0.63), use of the antipsychotic medication (ϕK = 0.54), exposure to psychotropic medication (ϕK = 0.56) and the cumulative number of days spent in the facility (ϕK = 0.54).
Further research is needed to improve the explainability of the neural-network-based models for fall prediction to make decision-making processes transparent and accessible.

Author Contributions

All co-authors of this work meet criteria for authorship and made significant contribution to this work. OK worked on later versions of coding, performed computations and data analysis and wrote the manuscript. P Jin and X Shan worked on the original version of coding, data quality and experiment design. E Perez worked on the Explainable AI (XAI) methods, feature analysis and results interpretation. P Munro contributed to the theoretical framework and experiment design. RD Boyce conceived the research idea, coordinated and participated in all research activities. All authors have edited and critically evaluated the manuscript.

Acknowledgements

The authors thank R Saka for contributing to feature attendance analysis during a summer high school research experience.

Financial Disclosure

This research was funded in part by the Jewish Healthcare Foundation Regional Autonomous Patient Safety Initiative, the US National Institute on Aging (K01AG044433), NIMH P30 MH90333, Pittsburgh Claude D. Pepper Older Americans Independence Center (NIA P30 AG024827), the UPMC Endowment in Geriatric Psychiatry, the Pharmacy Quality Alliance-CVS Health Foundation Scholars Program, the Pittsburgh Health Data Alliance through the Center for Commercializable Applications and the Jewish Healthcare Foundation, NLM T15 5T15LM007059-37.

Ethical Disclosure

The University of Pittsburgh, Institutional Review Board, approved the study.

Predicting Falls in the Nursing Homes via Recurrent Neural Network Models

Abstract

Background

Methods

Results

Conclusions

Plain Language Summary

Tweetable Abstract

Introduction

Objectives

Methods

Data Sources

Outcome Measure

Data Preprocessing and Study Design

Machine Learning Model Development and Evaluation

Performance Metrics

Feature Attention Comparison Between the Prior CART-Logit and RNNs

Experiment Environment

Results

Patient Characteristics

Machine Learning Model Performance

Feature Attention Comparison Between the CART-Logit and RNNs

Discussion

Principal Findings

Explainability

Limitations

Future Perspective

Conclusion

Summary Points

Author Contributions

Acknowledgements

Financial Disclosure

Ethical Disclosure

References

Predicting Falls in the Nursing Homes via Recurrent Neural Network Models

Abstract

Background

Methods

Results

Conclusions

Plain Language Summary

Tweetable Abstract

Introduction

Objectives

Methods

Data Sources

Outcome Measure

Data Preprocessing and Study Design

Machine Learning Model Development and Evaluation

Performance Metrics

Feature Attention Comparison Between the Prior CART-Logit and RNNs

Experiment Environment

Results

Patient Characteristics

Machine Learning Model Performance

Feature Attention Comparison Between the CART-Logit and RNNs

Discussion

Principal Findings

Explainability

Limitations

Future Perspective

Conclusion

Summary Points

Author Contributions

Acknowledgements

Financial Disclosure

Ethical Disclosure

References

This website uses cookies