Diabetes affects over 38 million Americans and remains one of the most preventable yet fastest-growing chronic conditions globally. While clinical treatments continue to advance, a deeper understanding of the behavioural and physiological factors that drive diabetes risk is essential for early intervention and population-level prevention. This project applies a complete end-to-end data science pipeline to the CDC Behavioral Risk Factor Surveillance System (BRFSS), one of the largest ongoing population health surveys in the United States, covering 70,000+ respondents across all 50 states. Using Python, six machine learning algorithms, and a suite of statistical and visual analysis techniques, this project identifies the strongest predictors of diabetes risk, quantifies population-level exposure, and builds interpretable models that can support clinical and policy decision-making
What are we going to analyze?
This project investigates three interconnected questions: Which behavioural and clinical factors most strongly predict diabetes risk? How do these factors interact with each other across a large, representative US population? And which machine learning approach produces the most reliable predictions when evaluated rigorously on held-out data? The dataset includes 20 variables spanning clinical measurements (BMI, Blood Pressure, General Health Rating), behavioural indicators (Physical Activity, Smoking, Alcohol Consumption, Fruit and Vegetable intake), socioeconomic factors (Education, Income), and healthcare access indicators (Health Insurance, Cost Barriers to Care). The target variable - Diabetes_binary, is a binary outcome indicating whether a respondent has been diagnosed with diabetes or prediabetes
We analyze how diabetes risk varies across the population using scatter plots with OLS trendlines, age group distribution charts, and sex distribution breakdowns. Key clinical variables like BMI, Age group, General Health Rating, and Physical Health Days are plotted against the diabetes outcome to reveal the direction and strength of each relationship.
The project investigates 20 behavioural and clinical variables as potential risk factors. A correlation heatmap using Seaborn visualises the pairwise relationships between all key clinical variables and the diabetes outcome.
Six classification algorithms are trained, evaluated, and compared side by side. The algorithms compared are Logistic Regression, Random Forest , XGBoost via Gradient Boosting, Decision Tree, K-Nearest Neighbours, and Support Vector Machine.
The pair plot provides a comprehensive view of how BMI, Age, General Health Rating, and Physical Health Days interact with each other and with diabetes status simultaneously. Each scatter plot in the off-diagonal panels shows the joint distribution of two variables, coloured by diabetes outcome, blue for non-diabetic, red for diabetic. KDE (Kernel Density Estimate) curves along the diagonal show how each variable distributes differently between the two populations. Where the red and blue distributions separate clearly along a diagonal panel, that variable carries strong discriminative signal. Where they overlap heavily, the variable alone is insufficient to distinguish between groups. A tight clustering or clear separation of coloured points in any off-diagonal scatter panel reveals an interaction effect between two variables that is not visible from univariate analysis alone. The pair plot directly informs which feature combinations are most informative for the classification models and serves as an intuitive visual explainer of why ensemble methods that capture feature interactions outperform simple linear models on this dataset.
The horizontal and vertical axes represents all the variables and the bar charts indicating the frequency in the same variable and the scatter plots represents the correlation between the variables
Key Insights : Our analysis reveals distinct patterns within the data, offering valuable insights into how various factors contribute to diabetes progression. By examining these relationships, we can identify individuals who may be at higher risk for developing or experiencing more severe complications from the disease. This information empowers healthcare professionals to create targeted prevention and treatment plans, potentially mitigating the impact of diabetes on individual patients and healthcare systems as a whole.
BMI and General Health Rating together produce the clearest visual separation between diabetic and non-diabetic populations. Age interacts strongly with Physical Health Days, older respondents with more poor physical health days show markedly higher diabetes prevalence. These interaction patterns are precisely what Random Forest and XGBoost are designed to capture through recursive feature splitting
Visualizing the Web of Connections:
Each cell within the correlation matrix represents the correlation coefficient between two specific variables. These coefficients, ranging from -1 to +1, quantify the strength and direction of the linear relationship between the variables.
A coefficient close to +1 suggests a strong positive correlation, indicating that as one variable increases, the other tends to increase as well.
A coefficient close to -1 signifies a strong negative correlation, implying that an increase in one variable is often accompanied by a decrease in the other.
Values closer to 0 indicate a weaker or even nonexistent linear relationship between the variables.
By deciphering the patterns within the correlation matrix, we can glean valuable insights into how these factors are interrelated. For instance, a strong positive correlation between age and blood sugar levels might suggest an increased risk of diabetes with advancing age. Conversely, a negative correlation between physical activity and blood pressure could indicate the potential benefits of exercise in managing blood pressure and potentially mitigating diabetes risk.
By employing the correlation matrix as a roadmap, we can navigate the complexities of diabetes, paving the way for a more comprehensive understanding and ultimately, more effective strategies for managing and preventing this chronic condition.
A Multifaceted Exploration
The comprehensive analysis delves into the multifaceted landscape of diabetes risk, utilizing various data visualizations to illuminate the interplay between key factors. A few critical aspects
vs. Blood Pressure (BP)
This scatter plot investigates the potential relationship between blood pressure (BP) and diabetes progression. Each data point represents an individual, with their blood pressure reading on the horizontal axis and their diabetes progression score on the vertical axis. Analyzing the distribution of points can reveal potential trends. A positive correlation would suggest that higher blood pressure readings might be associated with a greater risk of diabetes progression.
vs. Age
This scatter plot depicts the potential association between age and the development or severity of diabetes. Each data point represents an individual, with their age plotted on the horizontal axis and their diabetes progression score on the vertical axis. By examining the distribution of these points, we can identify trends or patterns. A positive correlation would suggest that as age increases, so too does the risk of diabetes progression.
vs. Body Mass Index (BMI)
This visualization explores the potential link between body mass index (BMI) and diabetes progression. Similar to the age scatter plot, each data point represents an individual. Here, BMI is plotted on the horizontal axis, and diabetes progression is on the vertical axis. Observing any patterns or trends in the distribution of points can provide insights into how BMI might influence diabetes risk.
vs. Blood Sugar (Glucose)
This visualization focuses on the potential link between blood sugar levels (glucose) and the development or severity of diabetes. Each data point represents an individual, with their blood sugar level on the horizontal axis and their diabetes progression score on the vertical axis. By examining the spread of data points, we can identify trends or patterns. A positive correlation would suggest that higher blood sugar levels might be associated with a greater risk of diabetes progression.
In our quest to understand and potentially predict the course of diabetes, we leverage the power of linear regression models. This graph depicts the model's prediction for diabetes progression, offering insights into the relationship between one or more independent variables and the dependent variable, which is likely a measure of diabetes severity.
Decoding the Landscape:
The horizontal axis (X-axis): This axis typically represents the independent variable or factors that are believed to influence diabetes progression. In this case, it might be a single factor like age, BMI, or blood sugar level, or it could represent a combination of these factors combined into a single score.
The vertical axis (Y-axis): This axis represents the dependent variable, which is the predicted value of diabetes progression. The model essentially calculates a best-fit line through the data points, and this line represents the predicted progression based on the values of the independent variable(s).
Data Points: The scattered data points represent individual participants within the dataset. The position of each point reflects the measured value of the independent variable on the X-axis and the corresponding level of diabetes progression on the Y-axis.
The Regression Line: The diagonal line superimposed on the scatter plot represents the model's predicted fit. Ideally, this line should trend closely with the distribution of the data points. The closer the data points cluster around the line, the stronger the correlation between the independent variable(s) and the predicted diabetes progression.
This model serves as a foundational tool, providing insights into the potential relationship between the chosen variable(s) and diabetes progression. Further analysis with more complex models might be necessary to capture the nuances of diabetes, a condition influenced by multiple factors.
This project's data-driven approach to diabetes progression analysis holds the potential to revolutionize patient care. By uncovering key factors and patterns, we can empower healthcare professionals to:
Develop individualized treatment plans: Tailored approaches based on patient-specific risk factors and progression patterns.
Predict and prevent complications: Early identification of high-risk patients allows for proactive measures to minimize complications.
Improve patient education and self-management: Insights from data visualizations can be used to create targeted educational materials for patients.
This comprehensive analysis, coupled with compelling data visualizations, aims to illuminate the complexities of diabetes progression and pave the way for a more personalized and effective approach to managing this chronic condition, ultimately improving patient outcomes and the overall burden of diabetes on healthcare systems.
This project demonstrates a reproducible, production-ready clinical analytics pipeline that adapts to any labeled population health dataset, the column detection, model training, and output generation all run automatically regardless of whether the input is the CDC BRFSS dataset, the Pima Indians dataset, or any equivalent clinical CSV.
The pipeline is structured to support the kind of analytical workflows found in healthcare data science teams exploratory validation, model benchmarking, stakeholder-ready visualisation, and audit-ready documentation making it directly transferable to clinical program evaluation, population health analytics, and value-based care environments.