alzheimers_project

Project on Alzheimers Disease

View the Project on GitHub kenogihara/alzheimers_project

Alzheimer’s Disease Prediction

Project on Alzheimers Disease

By Ken Ogihara

Introduction

In this project, I develop a model that can predict whether someone has Alzheimer’s Disease with 94% accuracy. This dataset contains extensive information on 2,149 patients, including, demographic details, lifestyle factors, medical history, clinical history, symptoms, cognitive and functional assessments, and a diagnosis of Alzheimer’s Disease. This dataset is taken from Kaggle.

A model that can predict diagnosis of Alzheimers is particularly helpful because it provides insight into the features and lifestyle factors that have a signficant influence on the development of the disease. Moreover, early detection of the disease allows for timely intervention, which can slow the progression of the disease. Early detection also has immense benefits for healthcare allocation because it allows doctors to provide the necessary care and support at the right time; thereby, preventing the need for future intensive treatment. This dataset contains 35 columns:

Column Description
PatientID A unique identifier assigned to each patient (4751 to 6900).
Age The age of the patients ranges from 60 to 90 years.
Gender Gender of the patients, where 0 represents Male and 1 represents Female.
Ethnicity The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other.
EducationLevel The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor’s, 3: Higher.
BMI Body Mass Index of the patients, ranging from 15 to 40.
Smoking Smoking status, where 0 indicates No and 1 indicates Yes.
AlcoholConsumption Weekly alcohol consumption in units, ranging from 0 to 20.
PhysicalActivity Weekly physical activity in hours, ranging from 0 to 10.
DietQuality Diet quality score, ranging from 0 to 10.
SleepQuality Sleep quality score, ranging from 4 to 10.
FamilyHistoryAlzheimers Family history of Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes.
CardiovascularDisease Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.
Diabetes Presence of diabetes, where 0 indicates No and 1 indicates Yes.
Depression Presence of depression, where 0 indicates No and 1 indicates Yes.
HeadInjury History of head injury, where 0 indicates No and 1 indicates Yes.
Hypertension Presence of hypertension, where 0 indicates No and 1 indicates Yes.
SystolicBP Systolic blood pressure, ranging from 90 to 180 mmHg.
DiastolicBP Diastolic blood pressure, ranging from 60 to 120 mmHg.
CholesterolTotal Total cholesterol levels, ranging from 150 to 300 mg/dL.
CholesterolLDL Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
CholesterolHDL High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
CholesterolTriglycerides Triglycerides levels, ranging from 50 to 400 mg/dL.
MMSE Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.
FunctionalAssessment Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.
MemoryComplaints Presence of memory complaints, where 0 indicates No and 1 indicates Yes.
BehavioralProblems Presence of behavioral problems, where 0 indicates No and 1 indicates Yes.
ADL Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
Confusion Presence of confusion, where 0 indicates No and 1 indicates Yes.
Disorientation Presence of disorientation, where 0 indicates No and 1 indicates Yes.
PersonalityChanges Presence of personality changes, where 0 indicates No and 1 indicates Yes.
DifficultyCompletingTasks Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.
Forgetfulness Presence of forgetfulness, where 0 indicates No and 1 indicates Yes.
Diagnosis Diagnosis status for Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes.
DoctorInCharge This column contains confidential information about the doctor in charge, with “XXXConfid” as the value for all patients.

Data Cleaning and Exploratory Data Analysis

  1. Removing irrelevant columns I dropped DoctorInCharge because its only value, “XXXConfid”, is not helpful in our analysis.

The first five rows of the DataFrame with some of the columns is shown below:

print(alz.head()[["PatientID", "Age", "Gender", "Ethnicity", "BMI", "Smoking", "PhysicalActivity", "Diagnosis"]].to_markdown(index = False))
PatientID Age Gender Ethnicity BMI Smoking Physical Activity Diagnosis
4751 73 0 0 22.93 0 6.33 0
4752 89 0 0 26.83 0 7.62 0
4753 73 0 3 17.80 0 7.84 0
4754 74 1 0 33.80 1 8.43 0
4755 89 0 0 20.72 0 6.31 0

EDA: Univariate Analysis

I first created a univariate plot that shows the distribution of patients according to their diagnosis:

My second univariate plot shows the prevalence of Alzheimer’s Disease across races:

My third univariate plot shows the prevalence of Alzheimer’s Disease across education level:

EDA: Bivariate Analysis

In our dataset, the “MemoryComplaints” variable refers to subjective experiences where individuals express dissatisfaction or concern about their memory function. These concerns are self-reported.

The “Forgetfulness” variable refers to the actual experience of forgetting information or events. It is an observable behavior where indiviudals fail to recall something that they had previously learned or experienced.

These bivariate plots along with their respective tables show the distribution of diagnosis based on these two variables:

Mosaic Plot #1: Relationship Between Forgetfulness and Diagnosis

Forgetfulness Diagnosis
Does not Forget 0.353764
Forgets 0.353395

According to the first mosaic plot, we see that the majority of patients are not considered forgetful. Regardless, in both categories, the proportion of those who are diagnosed are approximately the same. This shows that the presence of forgetfulness is not a good indicator of Alzheimer’s Disease.

Mosaic Plot #2: Relationship Between MemoryComplaints and Diagnosis

Memory Complaints Diagnosis
Memory Complaints 0.639821
No Memory Complaints 0.278496

According to the second mosaic plots, we see that the majority of patients do not report any memory complaints. Among these patients, approximately 28% are diagnosed with Alzheimer’s Disease. On the other hand, among those who report memory complaints, 64% are diagnosed with Alzheimer’s Disease. But how do we know if this is actually true? In the next section, I will perform permutation testing to see if these differences are due to random chance.

Hypothesis Testing

Null hypothesis: The differences are due to random chance. There isn’t actually a relationship between diagnosis and memory complaints.

Alternative hypothesis: The differences are not due to random chance. People who report memory complaints are more likely to develop Alzheimer’s Disease.

Steps for Permutation Testing:

1. Compute the Test Statistic: I created an appropriate test statistic derived from the sample that quantifies the strength of association between memory complaints and Alzheimer’s Disease diagnosis. In this case, I used the difference in proportions between the two groups (patients with memory complaints vs. those without)

2. Create Shuffled Data: I shuffled the “MemoryComplaints” column randomly and recomputed the test statistic for each shuffled dataset 1000x.

3. Compute the p-value: This determines the probability of observing a test statistic as extreme as the one computed from the actual data, assuming the null hypothesis is true.

I chose a significance level of 0.05. Here are the results:

print(f"p_value: {p_value}")

p_value: 0.0

Our p-value of 0.0 suggests that the alternative hypothesis is true. Therefore, we will reject the null hypothesis. We conclude that there is enough evidence to suggest that the differences are not due to random chance and that those who report memory complaints are indeed more likely to be diagnosed with Alzheimer’s Disease.

Framing a Prediction Problem

I will predict whether a patient has Alzheimer’s Disease using four different models: Logistic Regression, Decision Tree Classifier, Random Forest Classifier, and Gradient Boost Classifier. I will use F1-score to calculate the models’ accuracy since the data is severely imbalanced. I will also compare models’ accuracy using recall.

The model should be trained only on features that are relevant to the prediction problem. I filtered out all variables that have little to no correlation with diagnosis. I used a correlation matrix to visualize the association between numerical variables and chi-squared test to find the best categorical variables.

We see that “MMSE”, “FunctionalAssessment”, and “ADL” are the top 3 numerical features that are most associated with diagnosis.

I used a chi-squared test to determine the best categorical features:

def is_correlated(x, y):
    cross_table = pd.crosstab(index = alz[x], columns = alz[y])
    chi_sq_result = chi2_contingency(cross_table,)
    p, c = chi_sq_result[1], "correlated" if chi_sq_result[1] < 0.05 else "not correlated"
    return (x, p, c)

chi_sq_results = []
for column in categorical_columns:
    result = is_correlated(column, "Diagnosis")
    chi_sq_results.append(result)
    
chi_sq_results.sort(key=lambda x: x[1])

print(chi_sq_results)

Top Categorical Features Based On Chi-Squared Test Results:

  1. (‘MemoryComplaints’, 1.5266050985264054e-45, ‘correlated’)
  2. (‘BehavioralProblems’, 4.731446795211873e-25, ‘correlated’)
  3. (‘Ethnicity’, 0.09780307184026778, ‘not correlated’)
  4. (‘Hypertension’, 0.11808887156379336, ‘not correlated’)
  5. (‘FamilyHistoryAlzheimers’, 0.14069795394928386, ‘not correlated’)
  6. (‘Diabetes’, 0.16224495200138433, ‘not correlated’)
  7. (‘CardiovascularDisease’, 0.1628367346921118, ‘not correlated’)
  8. (‘EducationLevel’, 0.21650771973324673, ‘not correlated’)
  9. (‘Disorientation’, 0.27978377696750084, ‘not correlated’)
  10. (‘Gender’, 0.35381831348465786, ‘not correlated’)
  11. (‘HeadInjury’, 0.3603226855585838, ‘not correlated’)
  12. (‘PersonalityChanges’, 0.37175710638032144, ‘not correlated’)
  13. (‘Confusion’, 0.4045413830124688, ‘not correlated’)
  14. (‘DifficultyCompletingTasks’, 0.7198556855473033, ‘not correlated’)
  15. (‘Depression’, 0.8283335436917469, ‘not correlated’)
  16. (‘Smoking’, 0.860493227376371, ‘not correlated’)
  17. (‘Forgetfulness’, 1.0, ‘not correlated’)

“MemoryComplaints” and “BehavioralProblems” are most associated with diagnosis but for the sake of this project I used the top 7 categorical features along with the 3 numerical features we established previously.

Final Model

Data Preparation

Model Training and Evaluation

A function train_and_evaluate is defined to:

Model Comparisons

Four different classifiers are trained and evaluated:

Model Performance Metrics

  Logistic Regression Decision Tree Classifier Random Forest Classifier Gradient Boosting Classifier
Test score 0.8253 0.8996 0.9349 0.9498
Precision score 0.7865 0.8859 0.9399 0.9378
Recall score 0.7143 0.8316 0.8776 0.9235
F1 score 0.7487 0.8579 0.9077 0.9306

Based on our results, the Gradient Boosting Classifier is the most promising model. It has the highest ROC AUC at 0.94. This means that the model has a 94% chance of correctly distinguishing between patients with and without Alzheimer’s Disease across various threshold levels. The ROC AUC score indicates strong overall discriminative power. Further analysis shows that the true positive rate, also known as, recall score is 92%. This means that the model incorrectly identifies a person as negative 8% of the time. Optimizing a model’s recall is especially important in the field of medicine since false negatives are worse than false positives.

Full Code

For the full code and more details, please refer to Jupyter Notebook.