Alzheimer’s Disease Prediction

Project on Alzheimers Disease

By Ken Ogihara

Introduction

In this project, I develop a model that can predict whether someone has Alzheimer’s Disease with 94% accuracy. This dataset contains extensive information on 2,149 patients, including, demographic details, lifestyle factors, medical history, clinical history, symptoms, cognitive and functional assessments, and a diagnosis of Alzheimer’s Disease. This dataset is taken from Kaggle.

A model that can predict diagnosis of Alzheimers is particularly helpful because it provides insight into the features and lifestyle factors that have a signficant influence on the development of the disease. Moreover, early detection of the disease allows for timely intervention, which can slow the progression of the disease. Early detection also has immense benefits for healthcare allocation because it allows doctors to provide the necessary care and support at the right time; thereby, preventing the need for future intensive treatment. This dataset contains 35 columns:

Column	Description
PatientID	A unique identifier assigned to each patient (4751 to 6900).
Age	The age of the patients ranges from 60 to 90 years.
Gender	Gender of the patients, where 0 represents Male and 1 represents Female.
Ethnicity	The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other.
EducationLevel	The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor’s, 3: Higher.
BMI	Body Mass Index of the patients, ranging from 15 to 40.
Smoking	Smoking status, where 0 indicates No and 1 indicates Yes.
AlcoholConsumption	Weekly alcohol consumption in units, ranging from 0 to 20.
PhysicalActivity	Weekly physical activity in hours, ranging from 0 to 10.
DietQuality	Diet quality score, ranging from 0 to 10.
SleepQuality	Sleep quality score, ranging from 4 to 10.
FamilyHistoryAlzheimers	Family history of Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes.
CardiovascularDisease	Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.
Diabetes	Presence of diabetes, where 0 indicates No and 1 indicates Yes.
Depression	Presence of depression, where 0 indicates No and 1 indicates Yes.
HeadInjury	History of head injury, where 0 indicates No and 1 indicates Yes.
Hypertension	Presence of hypertension, where 0 indicates No and 1 indicates Yes.
SystolicBP	Systolic blood pressure, ranging from 90 to 180 mmHg.
DiastolicBP	Diastolic blood pressure, ranging from 60 to 120 mmHg.
CholesterolTotal	Total cholesterol levels, ranging from 150 to 300 mg/dL.
CholesterolLDL	Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
CholesterolHDL	High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
CholesterolTriglycerides	Triglycerides levels, ranging from 50 to 400 mg/dL.
MMSE	Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.
FunctionalAssessment	Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.
MemoryComplaints	Presence of memory complaints, where 0 indicates No and 1 indicates Yes.
BehavioralProblems	Presence of behavioral problems, where 0 indicates No and 1 indicates Yes.
ADL	Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
Confusion	Presence of confusion, where 0 indicates No and 1 indicates Yes.
Disorientation	Presence of disorientation, where 0 indicates No and 1 indicates Yes.
PersonalityChanges	Presence of personality changes, where 0 indicates No and 1 indicates Yes.
DifficultyCompletingTasks	Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.
Forgetfulness	Presence of forgetfulness, where 0 indicates No and 1 indicates Yes.
Diagnosis	Diagnosis status for Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes.
DoctorInCharge	This column contains confidential information about the doctor in charge, with “XXXConfid” as the value for all patients.

Data Cleaning and Exploratory Data Analysis

Removing irrelevant columns I dropped DoctorInCharge because its only value, “XXXConfid”, is not helpful in our analysis.

The first five rows of the DataFrame with some of the columns is shown below:

print(alz.head()[["PatientID", "Age", "Gender", "Ethnicity", "BMI", "Smoking", "PhysicalActivity", "Diagnosis"]].to_markdown(index = False))

PatientID	Age	Gender	Ethnicity	BMI	Smoking	Physical Activity
4751	73	0	0	22.93	0	6.33
4752	89	0	0	26.83	0	7.62
4753	73	0	3	17.80	0	7.84
4754	74	1	0	33.80	1	8.43
4755	89	0	0	20.72	0	6.31

EDA: Univariate Analysis

I first created a univariate plot that shows the distribution of patients according to their diagnosis:

My second univariate plot shows the prevalence of Alzheimer’s Disease across races:

My third univariate plot shows the prevalence of Alzheimer’s Disease across education level:

EDA: Bivariate Analysis

In our dataset, the “MemoryComplaints” variable refers to subjective experiences where individuals express dissatisfaction or concern about their memory function. These concerns are self-reported.

The “Forgetfulness” variable refers to the actual experience of forgetting information or events. It is an observable behavior where indiviudals fail to recall something that they had previously learned or experienced.

These bivariate plots along with their respective tables show the distribution of diagnosis based on these two variables:

Mosaic Plot #1: Relationship Between Forgetfulness and Diagnosis

Forgetfulness	Diagnosis
Does not Forget	0.353764
Forgets	0.353395

According to the first mosaic plot, we see that the majority of patients are not considered forgetful. Regardless, in both categories, the proportion of those who are diagnosed are approximately the same. This shows that the presence of forgetfulness is not a good indicator of Alzheimer’s Disease.

Mosaic Plot #2: Relationship Between MemoryComplaints and Diagnosis

Memory Complaints	Diagnosis
Memory Complaints	0.639821
No Memory Complaints	0.278496

According to the second mosaic plots, we see that the majority of patients do not report any memory complaints. Among these patients, approximately 28% are diagnosed with Alzheimer’s Disease. On the other hand, among those who report memory complaints, 64% are diagnosed with Alzheimer’s Disease. But how do we know if this is actually true? In the next section, I will perform permutation testing to see if these differences are due to random chance.

Hypothesis Testing

Null hypothesis: The differences are due to random chance. There isn’t actually a relationship between diagnosis and memory complaints.

Alternative hypothesis: The differences are not due to random chance. People who report memory complaints are more likely to develop Alzheimer’s Disease.

Steps for Permutation Testing:

1. Compute the Test Statistic: I created an appropriate test statistic derived from the sample that quantifies the strength of association between memory complaints and Alzheimer’s Disease diagnosis. In this case, I used the difference in proportions between the two groups (patients with memory complaints vs. those without)

2. Create Shuffled Data: I shuffled the “MemoryComplaints” column randomly and recomputed the test statistic for each shuffled dataset 1000x.

3. Compute the p-value: This determines the probability of observing a test statistic as extreme as the one computed from the actual data, assuming the null hypothesis is true.

I chose a significance level of 0.05. Here are the results:

print(f"p_value: {p_value}")

p_value: 0.0

Our p-value of 0.0 suggests that the alternative hypothesis is true. Therefore, we will reject the null hypothesis. We conclude that there is enough evidence to suggest that the differences are not due to random chance and that those who report memory complaints are indeed more likely to be diagnosed with Alzheimer’s Disease.

Framing a Prediction Problem

I will predict whether a patient has Alzheimer’s Disease using four different models: Logistic Regression, Decision Tree Classifier, Random Forest Classifier, and Gradient Boost Classifier. I will use F1-score to calculate the models’ accuracy since the data is severely imbalanced. I will also compare models’ accuracy using recall.

The model should be trained only on features that are relevant to the prediction problem. I filtered out all variables that have little to no correlation with diagnosis. I used a correlation matrix to visualize the association between numerical variables and chi-squared test to find the best categorical variables.

We see that “MMSE”, “FunctionalAssessment”, and “ADL” are the top 3 numerical features that are most associated with diagnosis.

I used a chi-squared test to determine the best categorical features:

def is_correlated(x, y):
    cross_table = pd.crosstab(index = alz[x], columns = alz[y])
    chi_sq_result = chi2_contingency(cross_table,)
    p, c = chi_sq_result[1], "correlated" if chi_sq_result[1] < 0.05 else "not correlated"
    return (x, p, c)

chi_sq_results = []
for column in categorical_columns:
    result = is_correlated(column, "Diagnosis")
    chi_sq_results.append(result)
    
chi_sq_results.sort(key=lambda x: x[1])

print(chi_sq_results)

Top Categorical Features Based On Chi-Squared Test Results:

(‘MemoryComplaints’, 1.5266050985264054e-45, ‘correlated’)
(‘BehavioralProblems’, 4.731446795211873e-25, ‘correlated’)
(‘Ethnicity’, 0.09780307184026778, ‘not correlated’)
(‘Hypertension’, 0.11808887156379336, ‘not correlated’)
(‘FamilyHistoryAlzheimers’, 0.14069795394928386, ‘not correlated’)
(‘Diabetes’, 0.16224495200138433, ‘not correlated’)
(‘CardiovascularDisease’, 0.1628367346921118, ‘not correlated’)
(‘EducationLevel’, 0.21650771973324673, ‘not correlated’)
(‘Disorientation’, 0.27978377696750084, ‘not correlated’)
(‘Gender’, 0.35381831348465786, ‘not correlated’)
(‘HeadInjury’, 0.3603226855585838, ‘not correlated’)
(‘PersonalityChanges’, 0.37175710638032144, ‘not correlated’)
(‘Confusion’, 0.4045413830124688, ‘not correlated’)
(‘DifficultyCompletingTasks’, 0.7198556855473033, ‘not correlated’)
(‘Depression’, 0.8283335436917469, ‘not correlated’)
(‘Smoking’, 0.860493227376371, ‘not correlated’)
(‘Forgetfulness’, 1.0, ‘not correlated’)

“MemoryComplaints” and “BehavioralProblems” are most associated with diagnosis but for the sake of this project I used the top 7 categorical features along with the 3 numerical features we established previously.

Final Model

Data Preparation

Splitting Data: The data is split into training and test sets using train_test_split.
Standardizing Data: The features are standardized to have a mean of 0 and a standard deviation of 1 using StandardScaler.

Model Training and Evaluation

A function train_and_evaluate is defined to:

Train the classifier.
Make predictions.
Evaluate the model performance using metrics such as Test Score, Precision, Recall, and F1 Score.
Plot the ROC curve to visualize the performance of the classifier.

Model Comparisons

Four different classifiers are trained and evaluated:

Logistic Regression
Decision Tree
Random Forest
Gradient Boosting

Model Performance Metrics

	Logistic Regression	Decision Tree Classifier	Random Forest Classifier	Gradient Boosting Classifier
Test score	0.8253	0.8996	0.9349	0.9498
Precision score	0.7865	0.8859	0.9399	0.9378
Recall score	0.7143	0.8316	0.8776	0.9235
F1 score	0.7487	0.8579	0.9077	0.9306

Based on our results, the Gradient Boosting Classifier is the most promising model. It has the highest ROC AUC at 0.94. This means that the model has a 94% chance of correctly distinguishing between patients with and without Alzheimer’s Disease across various threshold levels. The ROC AUC score indicates strong overall discriminative power. Further analysis shows that the true positive rate, also known as, recall score is 92%. This means that the model incorrectly identifies a person as negative 8% of the time. Optimizing a model’s recall is especially important in the field of medicine since false negatives are worse than false positives.

Full Code

For the full code and more details, please refer to Jupyter Notebook.