using machine learning to predict risk of type2 diabetes

classification logistic regression LASSO regression random forest decision tree naive bayes knn xgboost

Type 2 diabetes is one of the most prevalent chronic diseases in the United States, affecting the health of millions of people, and putting an enormous financial burden on the US economy.

Mark Y


This “assignment” was inspired on the works of Xie Z, Nikolayeva O, Luo J, Li D. Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques. Their paper can be accessed via this link.

My objective is to practice and learn how to build predictive models using machine learning techniques, in the spirit of the original study, but using the most recent survey data (2022). It would be a bonus if my models came close to the performance of Dr Xie’s.

To recap, the original definition of an individual with Type 2 Diabetes is: - an individual aged 30 years or older (respondents younger than 30 years old were excluded as they most likely had Type 1 diabetes), - an individual who had been told by a healthcare professional that he/she had Type 2 diabetes, - respondents who had pre-diabetes, or respondents who had diabetes while pregnant, were excluded from the study.

rm(list = ls())
# Set packages and dependencies
pacman::p_load("tidyverse", #for tidy data science practice
               "tidymodels", "workflows",# for tidy machine learning
               "pacman", #package manager
               "devtools", #developer tools
               "Hmisc", "skimr", "broom", "modelr",#for EDA
               "jtools", "huxtable", "interactions", # for EDA
               "ggthemes", "ggstatsplot", "GGally",
               "scales", "gridExtra", "patchwork", "ggalt", "vip",
               "ggstance", "ggfortify", # for ggplot
               "DT", "plotly", #interactive Data Viz
               # Lets install some ML related packages that will help tidymodels::
               "usemodels", "poissonreg", "agua", "sparklyr", "dials",#load computational engines
               "doParallel", # for parallel processing (speedy computation)
               "ranger", "xgboost", "glmnet", "kknn", "earth", "klaR", "discrim", "naivebayes",#random forest
               "janitor", "lubridate", "haven")

Data Source

I obtained the latest available Behavioral Risk Factor Surveillance System (BRFSS 2022) data available from the Centers for Disease Control and Prevention.

The Behavioral Risk Factor Surveillance System (BRFSS) is the US’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The BRFSS 2022 data from CDC was stored in an SAS (.XPT) file format. This was imported into R using read_xpt from the haven package. It had 445132 rows representing individual survey responses and 328 columns representing variables.

df <- read_xpt("LLCP2022.XPT")

I included most of the independent variables from the original study, as well as several new variables of interest. Below is a summary of dependent and independent variables used:

Variable Description Values
diabete4 (Ever told) (you had) diabetes? yes, no
bmi5cat Four-categories of BMI (body mass index) 1. underweight, 2. normal weight, 3. overweight, 4.
smoker3 Four-levels of smoker status 1.everyday smoker, 2. someday smoker, 3. former smoker, 4. non-smoker
cvdstrk3 (Ever told) (you had) a stroke? 1.yes, 2. no
cvdcrhd4 (Ever told) (you had) angina or coronary heart disease? 1.yes, 2. no

GENHLTH Question: Would you say that in general your health is: 1 Excellent 71,878 16.15 17.40 2 Very good 148,444 33.35 31.84 3 Good 143,598 32.26 32.48 4 Fair 60,273 13.54 13.69 5 Poor 19,741 4.43 4.29 7 Don’t know/Not Sure 810 0.18 0.19 9 Refused 385 0.09 0.10 BLANK Not asked or Missing 3 . .

_AGEG5YR 1 Age 18 to 24 Notes: 18 <= AGE <= 24 26,941 6.05 11.90 2 Age 25 to 29 Notes: 25 <= AGE <= 29 21,990 4.94 7.72 3 Age 30 to 34 Notes: 30 <= AGE <= 34 25,807 5.80 9.38 4 Age 35 to 39 Notes: 35 <= AGE <= 39 28,526 6.41 7.63 5 Age 40 to 44 Notes: 40 <= AGE <= 44 29,942 6.73 8.41 6 Age 45 to 49 Notes: 45 <= AGE <= 49 28,531 6.41 6.49 7 Age 50 to 54 Notes: 50 <= AGE <= 54 33,644 7.56 7.72 8 Age 55 to 59 Notes: 55 <= AGE <= 59 36,821 8.27 7.31 9 Age 60 to 64 Notes: 60 <= AGE <= 64 44,511 10.00 8.67 10 Age 65 to 69 Notes: 65 <= AGE <= 69 47,099 10.58 6.98 11 Age 70 to 74 Notes: 70 <= AGE <= 74 43,472 9.77 6.32 12 Age 75 to 79 Notes: 75 <= AGE <= 79 32,518 7.31 4.37 13 Age 80 or older Notes: 80 <= AGE <= 99 36,251 8.14 4.94 14 Don’t know/Refused/Missing Notes: 7 <= AGE <= 9 9,079 2.04 2.15

_BMI5CAT Question: Four-categories of Body Mass Index (BMI) 1 Underweight Notes: _BMI5 < 1850 (_BMI5 has 2 implied decimal places) 6,778 1.71 2.03 2 Normal Weight Notes: 1850 <= _BMI5 < 2500 116,976 29.52 30.50 3 Overweight Notes: 2500 <= _BMI5 < 3000 139,995 35.32 34.14 4 Obese Notes: 3000 <= _BMI5 < 9999 132,577 33.45 33.32

CHECKUP1 Question: About how long has it been since you last visited a doctor for a routine checkup? 1 Within past year (anytime less than 12 months ago) 350,944 78.84 74.97 2 Within past 2 years (1 year but less than 2 years ago) 41,919 9.42 10.74 3 Within past 5 years (2 years but less than 5 years ago) 24,882 5.59 6.75 4 5 or more years ago 19,079 4.29 5.13 7 Don’t know/Not sure 5,063 1.14 1.39 8 Never 2,509 0.56 0.83 9 Refused 733 0.16 0.20

INCOME3 Question: Is your annual household income from all sources: (If respondent refuses at any income level, code ´Refused.´) 1 Less than $10,000 10,341 2.39 2.95 2 Less than $15,000 ($10,000 to < $15,000) 11,031 2.55 2.43 3 Less than $20,000 ($15,000 to < $20,000) 14,300 3.31 3.44 4 Less than $25,000 ($20,000 to < $25,000) 20,343 4.71 4.71 5 Less than $35,000 ($25,000 to < $35,000) 42,294 9.79 9.92 6 Less than $50,000 ($35,000 to < $50,000) 46,831 10.84 10.20 7 Less than $75,000 ($50,000 to < $75,000) 59,148 13.69 12.42 8 Less than $100,000? ($75,000 to < $100,000) 48,436 11.21 10.42 9 Less than $150,000? ($100,000 to < $150,000)? 50,330 11.65 11.19 10 Less than $200,000? ($150,000 to < $200,000) 22,553 5.22 5.39 11 $200,000 or more 23,478 5.43 6.13 77 Don’t know/Not sure 36,114 8.36 10.44 99 Refused 47,001 10.87 10.37 BLANK Not asked or Missing 12,932 . .

FLUSHOT7 Question: During the past 12 months, have you had either flu vaccine that was sprayed in your nose or flu shot injected into your arm? 1 Yes 209,256 52.11 44.53 2 No—Go to Section 15.03 PNEUVAC4 188,755 47.01 54.46 7 Don’t know/Not Sure—Go to Section 15.03 PNEUVAC4 2,455 0.61 0.69 9 Refused—Go to Section 15.03 PNEUVAC4 1,073 0.27 0.32 BLANK 43,593 . .

EMPLOY1 Question: Are you currently…? 1 Employed for wages 186,004 42.38 47.34 2 Self-employed 38,768 8.83 9.46 3 Out of work for 1 year or more 8,668 1.97 2.54 4 Out of work for less than 1 year 8,044 1.83 2.56 5 A homemaker 17,477 3.98 4.94 6 A student 11,111 2.53 4.80 7 Retired 137,083 31.23 20.46 8 Unable to work 26,737 6.09 6.41 9 Refused 5,044 1.15 1.48 BLANK Not asked or Missing 6,196 . .

SEXVAR Question: Sex of Respondent 1 Male—Code=1 if LANDSEX1=1 or CELLSEX1=1 or COLGSEX1=1 209,239 47.01 48.69 2 Female—Code=2 if LANDSEX1=2 or CELLSEX1=2 or COLGSEX1=2 235,893 52.99 51.31

MARITAL Question: Are you: (marital status) 1 Married 227,424 51.09 49.33 2 Divorced 57,516 12.92 10.20 3 Widowed 48,019 10.79 7.03 4 Separated 8,702 1.95 2.36 5 Never married 80,001 17.97 24.71 6 A member of an unmarried couple 18,668 4.19 5.20 9 Refused 4,794 1.08 1.18 BLANK Not asked or Missing 8 . .

EDUCAG Question: Level of education completed 1 Did not graduate High School Notes: EDUCA = 1 or 2 or 3 26,011 5.84 11.63 2 Graduated High School Notes: EDUCA = 4 108,990 24.48 27.39 3 Attended College or Technical School Notes: EDUCA = 5 120,252 27.01 30.04 4 Graduated from College or Technical School Notes: EDUCA = 6 187,496 42.12 30.34 9 Don’t know/Not sure/Missing Notes: EDUCA = 9 or Missing 2,383 0.54 0.60

SLEPTIM1 Question: On average, how many hours of sleep do you get in a 24-hour period? 1 - 24 Number of hours [1-24] 439,679 98.78 98.57 77 Don’t know/Not Sure 4,792 1.08 1.23 99 Refused 658 0.15 0.21 BLANK Missing 3 . .

CVDCRHD4 Question: (Ever told) (you had) angina or coronary heart disease? 1 Yes 26,551 5.96 4.40 2 No 414,176 93.05 94.67 7 Don’t know/Not sure 4,044 0.91 0.84 9 Refused 359 0.08 0.10 BLANK Not asked or Missing 2 .

PRIMINSR Question: What is the current primary source of your health insurance? 1 A plan purchased through an employer or union (including plans purchased through another person´s employer) 161,388 36.26 39.07 2 A private nongovernmental plan that you or another family member buys on your own 36,931 8.30 9.28 3 Medicare 135,848 30.52 20.78 4 Medigap 536 0.12 0.15 5 Medicaid 29,072 6.53 8.51 6 Children´s Health Insurance Program (CHIP) 188 0.04 0.06 7 Military related health care: TRICARE (CHAMPUS) / VA health care / CHAMP- VA 15,373 3.45 3.28 8 Indian Health Service 1,385 0.31 0.17 9 State sponsored health plan 12,878 2.89 2.76 10 Other government program 10,630 2.39 2.70 88 No coverage of any type 23,018 5.17 8.07 77 Don’t know/Not Sure 9,890 2.22 3.22 99 Refused 7,991 1.80 1.95 BLANK Not asked or Missing 4 . .

MENTHLTH Question: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? 1 - 30 Number of days Notes: _ _ Number of days 170,836 38.38 41.49 88 None 265,229 59.58 56.10 77 Don’t know/Not sure 6,589 1.48 1.76 99 Refused 2,475 0.56 0.65 BLANK Not asked or Missing 3 . .

CHCKDNY2 Question: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? 1 Yes 20,315 4.56 3.68 2 No 422,891 95.00 95.87 7 Don’t know / Not sure 1,581 0.36 0.35 9 Refused 343 0.08 0.10 BLANK Not asked or Missing 2 . .

_TOTINDA Question: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job 1 Had physical activity or exercise Notes: EXERANY2 = 1 337,559 75.83 75.85 2 No physical activity or exercise in last 30 days Notes: EXERANY2 = 2 106,480 23.92 23.85 9 Don’t know/Refused/Missing Notes: EXERANY2 = 7 or 9 or Missing 1,093 0.25 0.29

ADDEPEV3 Question: (Ever told) (you had) a depressive disorder (including depression, major depression, dysthymia, or minor depression)? 1 Yes 91,410 20.54 20.47 2 No 350,910 78.83 78.74 7 Don’t know/Not sure 2,140 0.48 0.62 9 Refused 665 0.15 0.17 BLANK Not asked or Missing 7 . .

RENTHOM1 Question: Do you own or rent your home? 1 Own 310,708 69.80 66.63 2 Rent 108,332 24.34 25.81 3 Other arrangement 21,463 4.82 6.11 7 Don’t know/Not Sure 1,099 0.25 0.49 9 Refused 3,521 0.79 0.96 BLANK Not asked or Missing Notes: Due to the nature of the data or the size of the table for display, this information is not printed for this report 9 . .

EXERANY2 Question: During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise? 1 Yes 337,559 75.83 75.85 2 No 106,480 23.92 23.85 7 Don’t know/Not Sure 724 0.16 0.18 9 Refused 367 0.08 0.11 BLANK Not asked or Missing 2 . .

BLIND Question: Are you blind or do you have serious difficulty seeing, even when wearing glasses? 1 Yes 23,658 5.56 5.78 2 No 399,910 94.04 93.75 7 Don’t know/Not Sure 1,042 0.25 0.27 9 Refused 667 0.16 0.20 BLANK Not asked or Missing 19,855 . .

DECIDE Question: Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions? 1 Yes 50,100 11.81 13.34 2 No 370,792 87.42 85.81 7 Don’t know/Not Sure 2,266 0.53 0.56 9 Refused 988 0.23 0.29 BLANK Not asked or Missing 20,986 . .

HLTHPLN Question: Adults who had some form of health insurance 1 Have some form of insurance Notes: PRIMINSR=1, 2, 3, 4, 5, 6, 7, 8, 9, 10 404,229 90.81 86.77 2 Do not have some form of health insurance Notes: PRIMINSR=88 23,018 5.17 8.07 9 Don´t know, refused or missing insurance response Notes: PRIMINSR=77, 99, or missing 17,885 4.02 5.16

DIABETE4 Question: (Ever told) (you had) diabetes? (If ´Yes´ and respondent is female, ask ´Was this only when you were pregnant?´. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)

1 Yes 61,158 13.74 12.04 2 Yes, but female told only during pregnancy—Go to Section 08.01 AGE 3,836 0.86 1.01 3 No—Go to Section 08.01 AGE 368,722 82.83 84.34 4 No, pre-diabetes or borderline diabetes—Go to Section 08.01 AGE 10,329 2.32 2.27 7 Don’t know/Not Sure—Go to Section 08.01 AGE 763 0.17 0.23 9 Refused—Go to Section 08.01 AGE 321 0.07 0.11 BLANK Not asked or Missing 3 . .

_SMOKER3 Question: Four-level smoker status: Everyday smoker, Someday smoker, Former smoker, Non-smoker 1 Current smoker - now smokes every day Notes: SMOKE100 = 1 and SMOKEDAY = 1 36,003 8.09 8.09 2 Current smoker - now smokes some days Notes: SMOKE100 = 1 and SMOKEDAY = 2 13,938 3.13 3.54 3 Former smoker Notes: SMOKE100 = 1 and SMOKEDAY = 3 113,774 25.56 21.87 4 Never smoked Notes: SMOKE100 = 2 245,955 55.25 57.07 9 Don’t know/Refused/Missing Notes: SMOKE100 = 1 and SMOKEDAY = 9 or SMOKE100 = 7 or 9 or Missing 35,462 7.97 9.44

DRNKWK2 Question: Calculated total number of alcoholic beverages consumed per week 0 Did not drink Notes: DROCDY4_=0 or AVEDRNK3=88 188,832 42.42 41.91 1 - 98999 Number of drinks per week Notes: 0 < DROCDY4_ < 990 206,595 46.41 44.78 99900 Don’t know/Not sure/Refused/Missing Notes: AVEDRNK3=.,77,99 or DROCDY4_=900 49,705 11.17 13.32

DRNKANY6 Question: Adults who reported having had at least one drink of alcohol in the past 30 days. 1 Yes Notes: 1 <= ALCDAY4 <= 231 210,891 47.38 46.04 2 No Notes: ALCDAY4=888 187,667 42.16 41.60 7 Don’t know/Not Sure Notes: ALCDAY4=777 3,447 0.77 0.94 9 Refused/Missing Notes: ALCDAY4=999, Missing 43,127 9.69 11.43

_CURECI2 Question: Adults who are current e-cigarette users 1 Not currently using E-cigarettes Notes: ECIGNOW2=1, 4 387,356 87.02 83.59 2 Current E-cigarette user Notes: ECIGNOW2=2,3 22,116 4.97 6.76 9 Don’t know/Refused/Missing Notes: ECIGNOW2=7,9, or missing 35,660 8.01 9.64

_RFSMOK3 Question: Adults who are current smokers 1 No Notes: _SMOKER3 = 3 or 4 359,729 80.81 78.93 2 Yes Notes: _SMOKER3 = 1 or 2 49,941 11.22 11.62 9 Don’t know/Refused/Missing Notes: _SMOKER3 = 9 35,462 7.97 9.44

_HADSIGM Question: Colonoscopy and sigmoidoscopy are exams to check for colon cancer. Have you ever had either of these exams? 1 Yes 213,158 72.82 68.17 2 No—Go to Section 11.06 COLNCNCR 76,372 26.09 30.53 7 Don’t know/Not Sure—Go to Section 11.06 COLNCNCR 1,811 0.62 0.74 9 Refused—Go to Section 11.06 COLNCNCR 1,378 0.47 0.55 BLANK Not asked or Missing Notes: Section 08.01, AGE, is less than 45; 152,413 . .

_INCOMG1 Question: Income categories 1 Less than $15,000 Notes: INCOME3=1,2 21,372 4.80 5.17 2 $15,000 to < $25,000 Notes: INCOME3=3,4 34,643 7.78 7.83 3 $25,000 to < $35,000 Notes: INCOME3=5 42,294 9.50 9.54 4 $35,000 to < $50,000 Notes: INCOME3=6 46,831 10.52 9.81 5 $50,000 to < $100,000 Notes: INCOME3=7,8 107,584 24.17 21.96 6 $100,000 to < $200,000 Notes: INCOME3=9,10 72,883 16.37 15.95 7 $200,000 or more Notes: INCOME3=11 23,478 5.27 5.89 9 Don’t know/Not sure/Missing Notes: INCOME3=77, 99, or missing 96,047 21.58 23.84

_EDUCAG Question: Level of education completed 1 Did not graduate High School Notes: EDUCA = 1 or 2 or 3 26,011 5.84 11.63 2 Graduated High School Notes: EDUCA = 4 108,990 24.48 27.39 3 Attended College or Technical School Notes: EDUCA = 5 120,252 27.01 30.04 4 Graduated from College or Technical School Notes: EDUCA = 6 187,496 42.12 30.34 9 Don’t know/Not sure/Missing Notes: EDUCA = 9 or Missing 2,383 0.54 0.60

_CHLDCNT Question: Number of children in household 1 No children in household Notes: CHILDREN = 88 321,907 72.32 64.10 2 One child in household Notes: CHILDREN = 01 46,241 10.39 13.23 3 Two children in household Notes: CHILDREN = 02 37,923 8.52 10.83 4 Three children in household Notes: CHILDREN = 03 15,975 3.59 4.78 5 Four children in household Notes: CHILDREN = 04 5,521 1.24 1.66 6 Five or more children in household Notes: 05 <= CHILDREN < 88 3,100 0.70 0.97 9 Don’t know/Not sure/Missing Notes: CHILDREN = 99 14,464 3.25 4.43 BLANK 1 . .

_BMI5 Question: Body Mass Index (BMI)

WTKG3 Question: Reported weight in kilograms

HTM4 Question: Reported height in meters

_AGE80 Question: Imputed Age value collapsed above 80 18 - 24 Imputed Age 18 to 24 26,943 6.05 11.90 25 - 29 Imputed Age 25 to 29 22,000 4.94 7.73 30 - 34 Imputed Age 30 to 34 25,840 5.81 9.41 35 - 39 Imputed Age 35 to 39 28,771 6.46 7.79 40 - 44 Imputed Age 40 to 44 30,403 6.83 8.68 45 - 49 Imputed Age 45 to 49 29,580 6.65 6.86 50 - 54 Imputed Age 50 to 54 37,404 8.40 8.54 55 - 59 Imputed Age 55 to 59 38,059 8.55 7.44 60 - 64 Imputed Age 60 to 64 44,681 10.04 8.71 65 - 69 Imputed Age 65 to 69 47,642 10.70 7.07 70 - 74 Imputed Age 70 to 74 44,940 10.10 6.53 75 - 79 Imputed Age 75 to 79 32,616 7.33 4.40 80 - 99 Imputed Age 80 or older 36,253 8.14 4.94

_RACEPR1 Question: Computed race groups used for internet prevalence tables 1 White only, non-Hispanic Notes: _RACE=1 or _RACE=9 and _IMPRACE=1 333,514 74.92 59.20 2 Black only, non-Hispanic Notes: _RACE=2 or _RACE=9 and _IMPRACE=2 35,876 8.06 11.62 3 American Indian or Alaskan Native only, Non-Hispanic Notes: _RACE=3 or _RACE=9 and _IMPRACE=4 7,120 1.60 1.21 4 Asian only, non-Hispanic Notes: _RACE=4 or _RACE=9 and _IMPRACE=3 13,487 3.03 6.11 5 Native Hawaiian or other Pacific Islander only, Non-Hispanic Notes: _RACE=5 2,414 0.54 0.48 6 Multiracial, non-Hispanic Notes: _RACE=6 9,744 2.19 3.12 7 Hispanic Notes: _RACE=7 or _RACE=9 and _IMPRACE==5 42,977 9.65 18.25

_DRDXAR2 Question: Respondents who have had a doctor diagnose them as having some form of arthritis 1 Diagnosed with arthritis Notes: HAVARTH4 = 1 151,148 34.16 26.64 2 Not diagnosed with arthritis Notes: HAVARTH4 = 2 291,351 65.84 73.36 BLANK Don´t know/Not Sure/Refused/Missing Notes: HAVARTH4 = 7 or 9 or Missing 2,633 . .

ASTHMA3 Question: (Ever told) (you had) asthma? 1 Yes 66,694 14.98 15.17 2 No—Go to Section 07.06 CHCSCNC1 376,665 84.62 84.34 7 Don’t know/Not Sure—Go to Section 07.06 CHCSCNC1 1,494 0.34 0.42 9 Refused—Go to Section 07.06 CHCSCNC1 277 0.06 0.08 BLANK Not asked or Missing 2 . .

_DENVST3 Question: Adults who have visited a dentist, dental hygenist or dental clinic within the past year 1 Yes Notes: LASTDEN4=1 292,408 65.69 62.66 2 No Notes: LASTDEN4=2 or 3 or 4 145,703 32.73 35.42 9 Don’t know/Not Sure Or Refused/Missing Notes: LASTDEN4=7 or 9 or Missing 7,017 1.58 1.93 BLANK Missing 4 . .

SDHISOLT Question: How often do you feel socially isolated from others? Is it… 1 Always 8,098 3.19 4.06 2 Usually 13,178 5.19 5.63 3 Sometimes 53,072 20.91 21.62 4 Rarely 70,617 27.82 26.18 5 Never 106,160 41.83 41.21 7 Don’t know/Not Sure 1,696 0.67 0.79 9 Refused 969 0.38 0.50 BLANK Not asked or Missing 191,342 . .

LSATISFY Question: In general, how satisfied are you with your life? 1 Very satisfied 114,252 44.89 42.07 2 Satisfied 123,445 48.51 50.46 3 Dissatisfied 10,758 4.23 4.67 4 Very dissatisfied 3,062 1.20 1.38 7 Don’t know/Not sure 1,864 0.73 0.90 9 Refused 1,107 0.43 0.51 BLANK Not asked or Missing 190,644 . .

DIFFWALK Question: Do you have serious difficulty walking or climbing stairs? 1 Yes 68,081 16.10 13.75 2 No 353,039 83.47 85.78 7 Don’t know/Not Sure 1,221 0.29 0.28 9 Refused 636 0.15 0.19 BLANK Not asked or Missing 22,155 . .

DIFFDRES Question: Do you have difficulty dressing or bathing? 1 Yes 16,813 3.98 3.85 2 No 404,404 95.77 95.81 7 Don’t know/Not Sure 488 0.12 0.15 9 Refused 548 0.13 0.19 BLANK Not asked or Missing 22,879 . .

DEAF Question: Are you deaf or do you have serious difficulty hearing? 1 Yes 38,946 9.13 7.06 2 No 385,539 90.40 92.44 7 Don’t know/Not Sure 1,246 0.29 0.27 9 Refused 757 0.18 0.23 BLANK Not asked or Missing 18,644 . .

PHYSHLTH Question: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? 1 - 30 Number of days 166,386 37.38 36.75 88 None 267,819 60.17 60.54 77 Don’t know/Not sure 8,875 1.99 2.21 99 Refused 2,047 0.46 0.50 BLANK Not asked or Missing 5 . .

CDASSIST Question: As a result of confusion or memory loss, how often do you need assistance with these day-to-day activities? 1 Always 304 4.09 4.57 2 Usually 281 3.78 4.90 3 Sometimes 1,354 18.22 21.00 4 Rarely—Go to Module 13.05 CDSOCIAL 1,447 19.47 19.25 5 Never—Go to Module 13.05 CDSOCIAL 3,954 53.21 49.12 7 Don’t know/Not sure—Go to Module 13.05 CDSOCIAL 78 1.05 1.04 9 Refused—Go to Module 13.05 CDSOCIAL 13 0.17 0.13 BLANK Not asked or Missing Notes: Section 08.01, AGE, is less than 45; or Module 13.01, CIMEMLOS, is coded 2 or 9 437,701 . .

CVDSTRK3 Question: (Ever told) (you had) a stroke. 1 Yes 19,239 4.32 3.56 2 No 424,336 95.33 96.01 7 Don’t know/Not sure 1,274 0.29 0.35 9 Refused 281 0.06 0.08 BLANK Not asked or Missing 2 . .

CVDCRHD4 Question: (Ever told) (you had) angina or coronary heart disease? 1 Yes 26,551 5.96 4.40 2 No 414,176 93.05 94.67 7 Don’t know/Not sure 4,044 0.91 0.84 9 Refused 359 0.08 0.10 BLANK Not asked or Missing 2 . .

data <-
  df %>% 
  dplyr::select("DIABETE4", # response variable
                # personal health
                "_BMI5CAT", "_BMI5", #bmi cat, bmi numeric
                "_SMOKER3", "CVDSTRK3", "CVDCRHD4", #smoke, stroke, heart disease
                "_CURECI2", # e-cig
                # age, income cat, employ, gender, marital, education, home (rent/own)
                "_AGEG5YR", "INCOME3", "EMPLOY1", "SEXVAR", "MARITAL", "_EDUCAG", "RENTHOM1",
                # number children, age numeric, race
                "_CHLDCNT", "_AGE80", "_RACEPR1",
                #self assessment
                "GENHLTH", "PRIMINSR", "MENTHLTH", "BLIND", "DECIDE", "_HLTHPLN", "WTKG3", "HTM4", 
                "DIFFWALK", "DIFFDRES", "DEAF", "PHYSHLTH",
                "SLEPTIM1", "_TOTINDA", "EXERANY2",  "_DRNKWK2", "DRNKANY6", 
                "CHECKUP1", "FLUSHOT7", "CVDCRHD4", "CHCKDNY2", "ADDEPEV3",
                "_DRDXAR2", "ASTHMA3", "_DENVST3") %>% 
  janitor::clean_names() %>% 
  mutate(diabete4 = as.factor(case_when(diabete4 == 1 ~ "yes",
                              diabete4 == 2 ~ "no",
                              diabete4 == 3 ~ "no",
                              diabete4 == 4 ~ "no")
         bmi5cat = factor(bmi5cat),
         bmi5 = as.numeric(bmi5/100),
         smoker3 = as.factor(case_when(smoker3 == 1 ~ "smoker",
                                       smoker3 == 2 ~ "smoker",
                                       smoker3 == 3 ~ "former smoker",
                                       smoker3 == 4 ~ "non-smoker")
         cvdstrk3 = as.factor(case_when(cvdstrk3 == 7 ~ NA_character_,
                                        cvdstrk3 == 9 ~ NA_character_,
                                        .default = as.factor(cvdstrk3)
         cvdcrhd4 = as.factor(case_when(cvdcrhd4 == 7 ~ NA_character_,
                                        cvdcrhd4 == 9 ~ NA_character_,
                                        .default = as.factor(cvdcrhd4)
         cureci2 = as.factor(case_when(cureci2 == 9 ~ NA_character_,
                                       .default = as.factor(cureci2)
         ageg5yr = case_when(ageg5yr == 14 ~ NA_character_,
                                       .default = as.character(ageg5yr)
         ageg5yr = as.numeric(ageg5yr),
         income3 = as.factor(case_when(income3 == 77 ~ NA_character_,
                                       income3 == 99 ~ NA_character_,
                                       .default = as.factor(income3)
         employ1 = as.factor(case_when(employ1 == 9 ~ NA_character_,
                                       .default = as.factor(employ1)
         sexvar = as.factor(sexvar),
         marital = as.factor(case_when(marital == 9 ~ NA_character_,
                                       .default = as.factor(marital)
         educag = as.factor(case_when(educag == 9 ~ NA_character_,
                                      .default = as.factor(educag)
         renthom1 = as.factor(case_when(renthom1 == 7 ~ NA_character_,
                                        renthom1 == 9 ~ NA_character_,
                                        .default = as.factor(renthom1)
         chldcnt = as.factor(case_when(chldcnt == 1 ~ "0",
                                       chldcnt == 2 ~ "1",
                                       chldcnt == 3 ~ "2",
                                       chldcnt == 4 ~ "3",
                                       chldcnt == 5 ~ "4",
                                       chldcnt == 6 ~ "5 or more",
                                       chldcnt == 9 ~ NA_character_)
         age80 = as.numeric(age80),
         racepr1 = as.factor(racepr1),
         genhlth = as.factor(case_when(genhlth == 9 ~ NA_character_,
                                       .default = as.factor(genhlth)
         priminsr = as.factor(case_when(priminsr == 88 ~ "11", # no coverage
                                        priminsr == 77 ~ NA_character_,
                                        priminsr == 99 ~ NA_character_,
                                        .default = as.factor(priminsr)
         menthlth = as.numeric(ifelse(menthlth == 88, 0, menthlth)), #filter out 77 and 99 later
         blind = as.factor(case_when(blind == 7 ~ NA_character_,
                                     blind == 9 ~ NA_character_,
                                     .default = as.factor(blind)
         decide = as.factor(case_when(decide == 7 ~ NA_character_,
                                     decide == 9 ~ NA_character_,
                                     .default = as.factor(decide)
         hlthpln = as.factor(case_when(hlthpln == 9 ~ NA_character_,
                                       .default = as.factor(hlthpln)
         wtkg3 = as.numeric(wtkg3 / 100),
         htm4 = as.numeric(htm4 / 100),
         diffwalk = as.factor(case_when(diffwalk == 7 ~ NA_character_,
                                     diffwalk == 9 ~ NA_character_,
                                     .default = as.factor(diffwalk)
         diffdres = as.factor(case_when(diffdres == 7 ~ NA_character_,
                                     diffdres == 9 ~ NA_character_,
                                     .default = as.factor(diffdres)
         deaf = as.factor(case_when(deaf == 7 ~ NA_character_,
                                     deaf == 9 ~ NA_character_,
                                     .default = as.factor(deaf)
         physhlth = as.numeric(ifelse(physhlth == 88, 0, physhlth)), #filter out 77 and 99 later

         sleptim1 = as.numeric(sleptim1), # filter out 77 and 99
         totinda = as.factor(case_when(totinda == 9 ~ NA_character_,
                                       .default = as.factor(totinda)
         exerany2 = as.factor(case_when(exerany2 == 9 ~ NA_character_,
                                       .default = as.factor(exerany2)
         drnkwk2 = as.numeric(ifelse(drnkwk2 == 99900, NA_character_, drnkwk2)
         drnkany6 = as.factor(case_when(drnkany6 == 7 ~ NA_character_,
                                        drnkany6 == 9 ~ NA_character_,
                                        .default = as.factor(drnkany6)
         checkup1 = as.factor(case_when(checkup1 == 7 ~ NA_character_,
                                        checkup1 == 8 ~ NA_character_,
                                        checkup1 == 9 ~ NA_character_,
                                       .default = as.factor(checkup1)
         flushot7 = as.factor(case_when(flushot7 == 7 ~ NA_character_,
                                        flushot7 == 9 ~ NA_character_,
                                        .default = as.factor(flushot7)
         chckdny2 = as.factor(case_when(chckdny2 == 7 ~ NA_character_,
                                        chckdny2 == 9 ~ NA_character_,
                                        .default = as.factor(chckdny2)
         addepev3 = as.factor(case_when(addepev3 == 7 ~ NA_character_,
                                        addepev3 == 9 ~ NA_character_,
                                        .default = as.factor(addepev3)
         drdxar2 = as.factor(drdxar2),
         asthma3 = as.factor(case_when(asthma3 == 7 ~ NA_character_,
                                        asthma3 == 9 ~ NA_character_,
                                        .default = as.factor(asthma3)
         denvst3 = as.factor(case_when(denvst3 == 9 ~ NA_character_,
                                       .default = as.factor(denvst3)
data <-
  data %>% 
  filter (ageg5yr > 2 & age80 >=30 & menthlth < 77 & physhlth < 77 & sleptim1 < 77) %>%  # filter for age >-30 years definition of type 2 diabetes
  mutate(ageg5yr = as.factor(ageg5yr)
         ) %>% 


#write_csv(data, "diabetes_cleaned_data.csv")
# check correlation between numeric
data <- read_csv("diabetes_cleaned_data.csv")
data %>% 
  select_if(is.numeric) %>% 
  as.matrix(.) %>% 
  rcorr() %>% 
  tidy() %>% 
Column names: column1, column2, estimate, n, p.value

# split data
data_split <-
  data %>% 
  dplyr::sample_frac(size = 0.05, replace = FALSE) %>% #use 10% of data due to lack of computing power
  initial_split(strata = diabete4) # strata by diabete4
data_train <-
  data_split %>% 
data_test <-
  data_split %>% 
data_fold <-
  data_train %>% 
  vfold_cv(v = 10, strata = diabete4)
# split data
data_split_big <-
  data %>% 
  initial_split(strata = diabete4) # strata by diabete4
data_train_big <-
  data_split_big %>% 
data_test_big <-
  data_split_big %>% 
data_fold_big <-
  data_train_big %>% 
  vfold_cv(v = 10, strata = diabete4)
base_rec <-
  recipes::recipe(formula = diabete4 ~.,
                  data = data_train) %>% 

dummy_rec <-
  base_rec %>% 

normal_rec <-
  dummy_rec %>% 

log_rec <-
  base_rec %>% 
# random forest
rf_spec <-
  rand_forest(trees = 1000L) %>% 
             importance = "permutation") %>% 

rf_spec_for_tuning <-
  rf_spec %>% 
  set_args(mtry = tune(),
           min_n = tune())

# Classification Tree Model
ct_spec <- 
  decision_tree() %>%
  set_engine(engine = 'rpart') %>%

ct_spec_for_tuning <-
  ct_spec %>% 
  set_args(tree_depth = tune(),
           min_n = tune(), 
           cost_complexity = tune())

# knn
knn_spec <-
  nearest_neighbor() %>% 
  set_engine("kknn") %>% 

knn_spec_for_tuning <-
  knn_spec %>% 
  set_args(neighbors = tune(),
           weight_func = tune(),
           dist_power = tune())

# xgboost
xgb_spec <-
  boost_tree(trees = 1000L) %>% 
  set_engine("xgboost") %>% 

xgb_spec_for_tuning <-
  xgb_spec %>% 
  set_args(tree_depth = tune(),
           min_n = tune(),
           loss_reduction = tune(),
           sample_size = tune(),
           mtry = tune(),
           learn_rate = tune())

# # naive bayes

naive_spec <-
  naive_Bayes() %>%
             usepoisson = TRUE) %>%

naive_spec_for_tuning <-
  naive_spec %>% 
  set_args(smoothness = tune(),
           Laplace = tune())

# Logistic Regression Model
logistic_spec <- 
  logistic_reg() %>%
  set_engine(engine = 'glm') %>%

# Lasso Logistic Regression Model

logistic_lasso_spec <-
  logistic_reg(mixture = 1, penalty = 1) %>% 
  set_engine(engine = 'glmnet') %>%

logistic_lasso_spec_for_tuning <- 
  logistic_lasso_spec %>% 
  set_args(penalty = tune()) #we could let penalty = tune()
base_set <- #works
  workflow_set (
    list(base_rec, dummy_rec, log_rec), #preprocessor
    list(rf_spec, ct_spec,
         rf_spec_for_tuning, ct_spec_for_tuning), #model
    cross = TRUE) #default is cross = TRUE

dummy_set <- #works
  workflow_set (
    list(knn_spec, xgb_spec, logistic_spec,
         knn_spec_for_tuning, xgb_spec_for_tuning),
    cross = TRUE)

normal_set <-
    cross = TRUE)

naive_set <- #works
    list(base_rec, log_rec),
    cross = TRUE)

model_set <-
  bind_rows(base_set, dummy_set, normal_set, naive_set)