Project on Alzheimers Disease
Project on Alzheimers Disease
By Ken Ogihara
In this project, I develop a model that can predict whether someone has Alzheimer’s Disease with 94% accuracy. This dataset contains extensive information on 2,149 patients, including, demographic details, lifestyle factors, medical history, clinical history, symptoms, cognitive and functional assessments, and a diagnosis of Alzheimer’s Disease. This dataset is taken from Kaggle.
A model that can predict diagnosis of Alzheimers is particularly helpful because it provides insight into the features and lifestyle factors that have a signficant influence on the development of the disease. Moreover, early detection of the disease allows for timely intervention, which can slow the progression of the disease. Early detection also has immense benefits for healthcare allocation because it allows doctors to provide the necessary care and support at the right time; thereby, preventing the need for future intensive treatment. This dataset contains 35 columns:
| Column | Description |
|---|---|
| PatientID | A unique identifier assigned to each patient (4751 to 6900). |
| Age | The age of the patients ranges from 60 to 90 years. |
| Gender | Gender of the patients, where 0 represents Male and 1 represents Female. |
| Ethnicity | The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other. |
| EducationLevel | The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor’s, 3: Higher. |
| BMI | Body Mass Index of the patients, ranging from 15 to 40. |
| Smoking | Smoking status, where 0 indicates No and 1 indicates Yes. |
| AlcoholConsumption | Weekly alcohol consumption in units, ranging from 0 to 20. |
| PhysicalActivity | Weekly physical activity in hours, ranging from 0 to 10. |
| DietQuality | Diet quality score, ranging from 0 to 10. |
| SleepQuality | Sleep quality score, ranging from 4 to 10. |
| FamilyHistoryAlzheimers | Family history of Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes. |
| CardiovascularDisease | Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes. |
| Diabetes | Presence of diabetes, where 0 indicates No and 1 indicates Yes. |
| Depression | Presence of depression, where 0 indicates No and 1 indicates Yes. |
| HeadInjury | History of head injury, where 0 indicates No and 1 indicates Yes. |
| Hypertension | Presence of hypertension, where 0 indicates No and 1 indicates Yes. |
| SystolicBP | Systolic blood pressure, ranging from 90 to 180 mmHg. |
| DiastolicBP | Diastolic blood pressure, ranging from 60 to 120 mmHg. |
| CholesterolTotal | Total cholesterol levels, ranging from 150 to 300 mg/dL. |
| CholesterolLDL | Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL. |
| CholesterolHDL | High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL. |
| CholesterolTriglycerides | Triglycerides levels, ranging from 50 to 400 mg/dL. |
| MMSE | Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment. |
| FunctionalAssessment | Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment. |
| MemoryComplaints | Presence of memory complaints, where 0 indicates No and 1 indicates Yes. |
| BehavioralProblems | Presence of behavioral problems, where 0 indicates No and 1 indicates Yes. |
| ADL | Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment. |
| Confusion | Presence of confusion, where 0 indicates No and 1 indicates Yes. |
| Disorientation | Presence of disorientation, where 0 indicates No and 1 indicates Yes. |
| PersonalityChanges | Presence of personality changes, where 0 indicates No and 1 indicates Yes. |
| DifficultyCompletingTasks | Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes. |
| Forgetfulness | Presence of forgetfulness, where 0 indicates No and 1 indicates Yes. |
| Diagnosis | Diagnosis status for Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes. |
| DoctorInCharge | This column contains confidential information about the doctor in charge, with “XXXConfid” as the value for all patients. |
The first five rows of the DataFrame with some of the columns is shown below:
print(alz.head()[["PatientID", "Age", "Gender", "Ethnicity", "BMI", "Smoking", "PhysicalActivity", "Diagnosis"]].to_markdown(index = False))
| PatientID | Age | Gender | Ethnicity | BMI | Smoking | Physical Activity | Diagnosis |
|---|---|---|---|---|---|---|---|
| 4751 | 73 | 0 | 0 | 22.93 | 0 | 6.33 | 0 |
| 4752 | 89 | 0 | 0 | 26.83 | 0 | 7.62 | 0 |
| 4753 | 73 | 0 | 3 | 17.80 | 0 | 7.84 | 0 |
| 4754 | 74 | 1 | 0 | 33.80 | 1 | 8.43 | 0 |
| 4755 | 89 | 0 | 0 | 20.72 | 0 | 6.31 | 0 |
I first created a univariate plot that shows the distribution of patients according to their diagnosis:
My second univariate plot shows the prevalence of Alzheimer’s Disease across races:
My third univariate plot shows the prevalence of Alzheimer’s Disease across education level:
In our dataset, the “MemoryComplaints” variable refers to subjective experiences where individuals express dissatisfaction or concern about their memory function. These concerns are self-reported.
The “Forgetfulness” variable refers to the actual experience of forgetting information or events. It is an observable behavior where indiviudals fail to recall something that they had previously learned or experienced.
These bivariate plots along with their respective tables show the distribution of diagnosis based on these two variables:
| Forgetfulness | Diagnosis |
|---|---|
| Does not Forget | 0.353764 |
| Forgets | 0.353395 |
According to the first mosaic plot, we see that the majority of patients are not considered forgetful. Regardless, in both categories, the proportion of those who are diagnosed are approximately the same. This shows that the presence of forgetfulness is not a good indicator of Alzheimer’s Disease.
| Memory Complaints | Diagnosis |
|---|---|
| Memory Complaints | 0.639821 |
| No Memory Complaints | 0.278496 |
According to the second mosaic plots, we see that the majority of patients do not report any memory complaints. Among these patients, approximately 28% are diagnosed with Alzheimer’s Disease. On the other hand, among those who report memory complaints, 64% are diagnosed with Alzheimer’s Disease. But how do we know if this is actually true? In the next section, I will perform permutation testing to see if these differences are due to random chance.
Null hypothesis: The differences are due to random chance. There isn’t actually a relationship between diagnosis and memory complaints.
Alternative hypothesis: The differences are not due to random chance. People who report memory complaints are more likely to develop Alzheimer’s Disease.
1. Compute the Test Statistic: I created an appropriate test statistic derived from the sample that quantifies the strength of association between memory complaints and Alzheimer’s Disease diagnosis. In this case, I used the difference in proportions between the two groups (patients with memory complaints vs. those without)
2. Create Shuffled Data: I shuffled the “MemoryComplaints” column randomly and recomputed the test statistic for each shuffled dataset 1000x.
3. Compute the p-value: This determines the probability of observing a test statistic as extreme as the one computed from the actual data, assuming the null hypothesis is true.
I chose a significance level of 0.05. Here are the results:
print(f"p_value: {p_value}")
p_value: 0.0
Our p-value of 0.0 suggests that the alternative hypothesis is true. Therefore, we will reject the null hypothesis. We conclude that there is enough evidence to suggest that the differences are not due to random chance and that those who report memory complaints are indeed more likely to be diagnosed with Alzheimer’s Disease.
I will predict whether a patient has Alzheimer’s Disease using four different models: Logistic Regression, Decision Tree Classifier, Random Forest Classifier, and Gradient Boost Classifier. I will use F1-score to calculate the models’ accuracy since the data is severely imbalanced. I will also compare models’ accuracy using recall.
The model should be trained only on features that are relevant to the prediction problem. I filtered out all variables that have little to no correlation with diagnosis. I used a correlation matrix to visualize the association between numerical variables and chi-squared test to find the best categorical variables.
We see that “MMSE”, “FunctionalAssessment”, and “ADL” are the top 3 numerical features that are most associated with diagnosis.
I used a chi-squared test to determine the best categorical features:
def is_correlated(x, y):
cross_table = pd.crosstab(index = alz[x], columns = alz[y])
chi_sq_result = chi2_contingency(cross_table,)
p, c = chi_sq_result[1], "correlated" if chi_sq_result[1] < 0.05 else "not correlated"
return (x, p, c)
chi_sq_results = []
for column in categorical_columns:
result = is_correlated(column, "Diagnosis")
chi_sq_results.append(result)
chi_sq_results.sort(key=lambda x: x[1])
print(chi_sq_results)
“MemoryComplaints” and “BehavioralProblems” are most associated with diagnosis but for the sake of this project I used the top 7 categorical features along with the 3 numerical features we established previously.
Splitting Data: The data is split into training and test sets using train_test_split.
Standardizing Data: The features are standardized to have a mean of 0 and a standard deviation of 1 using StandardScaler.
A function train_and_evaluate is defined to:
Four different classifiers are trained and evaluated:
| Logistic Regression | Decision Tree Classifier | Random Forest Classifier | Gradient Boosting Classifier | |
|---|---|---|---|---|
| Test score | 0.8253 | 0.8996 | 0.9349 | 0.9498 |
| Precision score | 0.7865 | 0.8859 | 0.9399 | 0.9378 |
| Recall score | 0.7143 | 0.8316 | 0.8776 | 0.9235 |
| F1 score | 0.7487 | 0.8579 | 0.9077 | 0.9306 |
Based on our results, the Gradient Boosting Classifier is the most promising model. It has the highest ROC AUC at 0.94. This means that the model has a 94% chance of correctly distinguishing between patients with and without Alzheimer’s Disease across various threshold levels. The ROC AUC score indicates strong overall discriminative power. Further analysis shows that the true positive rate, also known as, recall score is 92%. This means that the model incorrectly identifies a person as negative 8% of the time. Optimizing a model’s recall is especially important in the field of medicine since false negatives are worse than false positives.
For the full code and more details, please refer to Jupyter Notebook.