25  Linear Regression

25.1 Acquiring the data

We begin by reading the data. Here we use the data from the Carotid Intima dataset

Code
dat <- 
  dget("C:/Dataset/cint_data_clean")%>%
  select(cca_0, sex, ageyrs, resid, hba1c, tobacco, alcohol, bmi, 
         whratio, totchol, ldl, hdl, trig, sbp, dbp)  %>%
  filter(!is.na(totchol), !is.na(ldl)) %>% 
  mutate(trig = if_else(trig > 40, trig/10, trig, missing = NULL))

Next, we summarize the data

Code
dat %>% 
    summarytools::dfSummary(graph.col = F)
Data Frame Summary  
dat  
Dimensions: 702 x 15  
Duplicates: 4  

-------------------------------------------------------------------------------------
No   Variable    Stats / Values            Freqs (% of Valid)    Valid      Missing  
---- ----------- ------------------------- --------------------- ---------- ---------
1    cca_0       Mean (sd) : 0.9 (0.2)     40 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.5 < 0.8 < 1.6                                                     
                 IQR (CV) : 0.2 (0.2)                                                

2    sex         1. Female                 545 (77.6%)           702        0        
     [factor]    2. Male                   157 (22.4%)           (100.0%)   (0.0%)   

3    ageyrs      Mean (sd) : 44.3 (9)      46 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 26 < 43 < 75                                                        
                 IQR (CV) : 12 (0.2)                                                 

4    resid       1. Rural                   24 ( 3.4%)           702        0        
     [factor]    2. Periurban              174 (24.8%)           (100.0%)   (0.0%)   
                 3. Urban                  504 (71.8%)                               

5    hba1c       Mean (sd) : 5.4 (1)       56 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 3.1 < 5.3 < 15.5                                                    
                 IQR (CV) : 0.8 (0.2)                                                

6    tobacco     1. No                     657 (93.6%)           702        0        
     [factor]    2. Yes                     45 ( 6.4%)           (100.0%)   (0.0%)   

7    alcohol     1. No                     458 (65.2%)           702        0        
     [factor]    2. Yes                    244 (34.8%)           (100.0%)   (0.0%)   

8    bmi         Mean (sd) : 26.6 (5.6)    582 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 13.3 < 26.1 < 45.3                                                  
                 IQR (CV) : 7.7 (0.2)                                                

9    whratio     Mean (sd) : 0.9 (0.1)     456 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.6 < 0.9 < 1.2                                                     
                 IQR (CV) : 0.1 (0.1)                                                

10   totchol     Mean (sd) : 5.1 (1.3)     73 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 1.1 < 5 < 11.1                                                      
                 IQR (CV) : 1.7 (0.3)                                                

11   ldl         Mean (sd) : 3.2 (1.1)     59 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0 < 3.1 < 8.1                                                       
                 IQR (CV) : 1.4 (0.3)                                                

12   hdl         Mean (sd) : 1.4 (0.5)     31 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.3 < 1.3 < 4.4                                                     
                 IQR (CV) : 0.6 (0.4)                                                

13   trig        Mean (sd) : 1.4 (0.9)     227 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.3 < 1.2 < 10.7                                                    
                 IQR (CV) : 0.8 (0.7)                                                

14   sbp         Mean (sd) : 125 (23.9)    118 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 66 < 121 < 231                                                      
                 IQR (CV) : 28 (0.2)                                                 

15   dbp         Mean (sd) : 79.5 (14.1)   71 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 43 < 79 < 135                                                       
                 IQR (CV) : 19 (0.2)                                                 
-------------------------------------------------------------------------------------

25.2 Linear regression with a single continuous variable

We begin by looking at the relationship between the dependent and independent variables

Code
dat %>% 
  ggplot(aes(x = ageyrs, y = cca_0)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x) +
  labs(
      x = "Age in years", 
      y = "Common Carotid Intima thickness",
      title = "Relationship between CCA and Age of patient") +
  theme_classic()

Since the relationship between the two looks linear we will go on to fit the model

Code
lm.1 <- 
    lm(cca_0 ~ ageyrs, data = dat)

Next w,e summarise the model, extract the coefficients and confidence intervals with the help of the flextable package.

Code
lm.1 %>% flextable::as_flextable()

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

0.628

0.029

21.988

0.0000

***

ageyrs

0.005

0.001

8.381

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.1511 on 700 degrees of freedom

Multiple R-squared: 0.0912, Adjusted R-squared: 0.0899

F-statistic: 70.25 on 700 and 1 DF, p-value: 0.0000

Next, we extract some regression analysis stats required for the regression diagnostics

Code
tibble(
    resid = residuals(lm.1), 
    fits = fitted(lm.1), 
    st.resid = rstandard(lm.1),
    cookd = cooks.distance(lm.1),
    covr = covratio(lm.1),
    hatv = hatvalues(lm.1),
    dfit = dffits(lm.1),
    dfbeta1 = dfbeta(lm.1)[,2]) %>% 
  arrange(desc(cookd)) %>% 
  round(5) %>% 
  slice(1:10) %>% 
    kableExtra::kable()
resid fits st.resid cookd covr hatv dfit dfbeta1
0.40242 0.97258 2.67540 0.03210 0.99127 0.00889 0.25449 0.00015
0.52507 0.92493 3.48188 0.02314 0.97212 0.00380 0.21685 0.00011
-0.25435 1.00435 -1.69521 0.02018 1.00862 0.01385 -0.20119 -0.00012
0.71213 0.88787 4.71760 0.02012 0.94181 0.00180 0.20372 0.00006
0.53625 0.81375 3.55451 0.01868 0.96985 0.00295 0.19491 -0.00009
0.34330 0.95670 2.28006 0.01800 0.99487 0.00688 0.19033 0.00011
0.19948 1.02552 1.33222 0.01614 1.01593 0.01786 0.17975 0.00011
-0.24376 0.99376 -1.62316 0.01608 1.00748 0.01206 -0.17953 -0.00011
0.35389 0.94611 2.34901 0.01585 0.99279 0.00571 0.17864 0.00010
-0.25287 0.97787 -1.68180 0.01375 1.00445 0.00963 -0.16604 -0.00010

And then plot the regression diagnostic graphs

Code
opar <- par(mfrow = c(2,2))
plot(lm.1, pch = 18, cex=.5)

Code
par(opar)

Below we formally check the model assumption with the performance package. First the normality of the residuals

Code
performance::check_predictions(lm.1)
Warning: Maximum value of original data is not included in the
  replicated data.
  Model may not capture the variation of the data.

Code
performance::check_normality(lm.1)
Warning: Non-normality of residuals detected (p < .001).
Code
performance::check_normality(lm.1) %>% plot()
For confidence bands, please install `qqplotr`.

Next heteroscedasticity

Code
performance::check_heteroscedasticity(lm.1)
Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.018).
Code
performance::check_heteroscedasticity(lm.1) %>% 
    plot()

Code
performance::check_distribution(lm.1)
# Distribution of Model Family

Predicted Distribution of Residuals

 Distribution Probability
       normal         34%
         beta         19%
       cauchy         16%

Predicted Distribution of Response

  Distribution Probability
 inverse-gamma         28%
         gamma         22%
          beta         16%
Code
performance::check_distribution(lm.1) %>% 
    plot()

25.2.1 Linear regression with one continuous and one categorical predictor

Code
lm.2 <- 
    lm(cca_0 ~ ageyrs + sex, data = dat)

lm.2 %>% 
    summary() 

Call:
lm(formula = cca_0 ~ ageyrs + sex, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35690 -0.11033 -0.01512  0.08711  0.71810 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.6301958  0.0285622  22.064  < 2e-16 ***
ageyrs      0.0051368  0.0006377   8.056 3.42e-15 ***
sexMale     0.0234025  0.0138148   1.694   0.0907 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1509 on 699 degrees of freedom
Multiple R-squared:  0.09491,   Adjusted R-squared:  0.09233 
F-statistic: 36.65 on 2 and 699 DF,  p-value: 7.294e-16

And then plot the diagnostics

Code
opar <- par(mfrow = c(2,2))
lm.2 %>% 
    plot(pch = 18, cex=.5)

Code
par(opar)

Further, we plot the marginal plot for the various sexes

Code
dat %>% 
  ggplot(aes(x = ageyrs, y = cca_0, col = sex)) +
  geom_smooth(formula = y~x, method = "lm") +
  geom_point() +
  labs(
      x = "Age in years", 
      y = "CCA", 
      title = "CCA vrs Age for each sex") +
  theme_bw()

Linear regression with one continuous and one categorical predictor with interaction

Code
lm.3 <- 
    lm(cca_0 ~ ageyrs*sex, data = dat)

lm.3 %>% 
    summary()

Call:
lm(formula = cca_0 ~ ageyrs * sex, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35711 -0.10997 -0.01482  0.08664  0.71789 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.6284770  0.0330494  19.016  < 2e-16 ***
ageyrs          0.0051762  0.0007429   6.968 7.45e-12 ***
sexMale         0.0303114  0.0681101   0.445    0.656    
ageyrs:sexMale -0.0001503  0.0014510  -0.104    0.918    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.151 on 698 degrees of freedom
Multiple R-squared:  0.09493,   Adjusted R-squared:  0.09104 
F-statistic:  24.4 on 3 and 698 DF,  p-value: 5.025e-15

And then plot the diagnostics

Code
opar <- par(mfrow = c(2,2))
lm.3 %>% 
    plot(pch = 18, cex=.5)

Code
par(opar)

Next, we make a plot of the marginal effects after

Code
pr.3 <- 
    ggeffects::ggpredict(lm.3, terms = c("ageyrs", "sex"))

pr.3 %>% 
  plot() +
  labs(
      x = "Age in years", 
      y = "CCA thickness", 
      title = "Marginal relationship")

Linear regression with one continuous and two categorical predictors with interaction

Code
lm.4 <- 
    lm(cca_0 ~ ageyrs*sex*resid, data = dat)

lm.4 %>% 
    summary()

Call:
lm(formula = cca_0 ~ ageyrs * sex * resid, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36621 -0.10171 -0.01644  0.08202  0.70879 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)   
(Intercept)                    0.434224   0.155773   2.788  0.00546 **
ageyrs                         0.010017   0.003277   3.057  0.00232 **
sexMale                       -1.476681   1.361101  -1.085  0.27834   
residPeriurban                 0.171485   0.167778   1.022  0.30710   
residUrban                     0.213439   0.160840   1.327  0.18494   
ageyrs:sexMale                 0.034164   0.031289   1.092  0.27527   
ageyrs:residPeriurban         -0.005066   0.003557  -1.424  0.15476   
ageyrs:residUrban             -0.005046   0.003400  -1.484  0.13821   
sexMale:residPeriurban         1.661961   1.367568   1.215  0.22468   
sexMale:residUrban             1.452145   1.363439   1.065  0.28722   
ageyrs:sexMale:residPeriurban -0.035862   0.031413  -1.142  0.25401   
ageyrs:sexMale:residUrban     -0.033653   0.031336  -1.074  0.28322   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1499 on 690 degrees of freedom
Multiple R-squared:  0.1184,    Adjusted R-squared:  0.1044 
F-statistic: 8.427 on 11 and 690 DF,  p-value: 5.047e-14

And then plot the diagnostics

Code
opar <- par(mfrow = c(2,2))
lm.4 %>% 
    plot(pch = 18, cex=.5)

Code
par(opar)

Next, we make a plot of the marginal effects after

Code
pr.4 <- ggeffects::ggpredict(lm.4, terms = c("ageyrs", "sex", "resid"))
pr.4 %>% 
  plot() +
  labs(x = "Age in years", y = "CCA thickness", title = "Marginal relationship")

There is a set of criteria for evaluating linear regression equations. These are:

  1. There must be a linear relationship between the predictor and dependent variables
  2. The residuals must be independent
  3. Homoscedaticity: The variance of the residuals must be uniform at every level of x
  4. The residual must be normally distributed
Code
corr.mat <- 
  dat %>% 
  select(cca_0, ageyrs, whratio, totchol, ldl, sbp, dbp) %>% 
  cor() %>% 
  round(4)

corr.mat %>% 
  kableExtra::kbl(align = c(rep("l", 7)), caption = "Correlation matrix") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "bordered", "condensed", "hover", "responsive"), 
                            full_width= FALSE, html_font = "san_serif", font_size = 16)
Correlation matrix
cca_0 ageyrs whratio totchol ldl sbp dbp
cca_0 1.0000 0.3020 0.1398 0.1387 0.1131 0.1963 0.1538
ageyrs 0.3020 1.0000 0.2962 0.2196 0.1563 0.2755 0.1782
whratio 0.1398 0.2962 1.0000 0.1521 0.1247 0.1681 0.1509
totchol 0.1387 0.2196 0.1521 1.0000 0.9086 0.3198 0.2647
ldl 0.1131 0.1563 0.1247 0.9086 1.0000 0.2553 0.2069
sbp 0.1963 0.2755 0.1681 0.3198 0.2553 1.0000 0.7936
dbp 0.1538 0.1782 0.1509 0.2647 0.2069 0.7936 1.0000

And then graphically present the correlation matrix

Code
dat %>% 
  select(cca_0, ageyrs, whratio, totchol, ldl, sbp, dbp) %>% 
  pairs(pch=".")

Code
corrplot::corrplot(corr.mat)

Both the correlation matrix and the diagram indicate a high correlation between

  1. sbp and dbp
  2. totchol, ldl and hdl

These could potentially be indicative of multicollinearity

25.3 Fitting the model

We then fit the linear model and summarize it

Code
lm1 <- lm(cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, data = dat)
summary(lm1)

Call:
lm(formula = cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, 
    data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33971 -0.10555 -0.01751  0.08800  0.68657 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4697581  0.0744135   6.313 4.88e-10 ***
sbp         0.0006056  0.0004044   1.498    0.135    
dbp         0.0002161  0.0006659   0.325    0.746    
ldl         0.0021276  0.0129757   0.164    0.870    
ageyrs      0.0044292  0.0006872   6.446 2.15e-10 ***
totchol     0.0035070  0.0104519   0.336    0.737    
whratio     0.0895525  0.0839122   1.067    0.286    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1502 on 695 degrees of freedom
Multiple R-squared:  0.1086,    Adjusted R-squared:  0.1009 
F-statistic: 14.11 on 6 and 695 DF,  p-value: 3.37e-15

The broom package gives us the opportunity of extracting the coefficient, residual, etc into a dataframe. We next do this below

Code
lm1 %>% broom::tidy()
# A tibble: 7 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 0.470     0.0744       6.31  4.88e-10
2 sbp         0.000606  0.000404     1.50  1.35e- 1
3 dbp         0.000216  0.000666     0.325 7.46e- 1
4 ldl         0.00213   0.0130       0.164 8.70e- 1
5 ageyrs      0.00443   0.000687     6.45  2.15e-10
6 totchol     0.00351   0.0105       0.336 7.37e- 1
7 whratio     0.0896    0.0839       1.07  2.86e- 1
Code
lm1 %>% broom::augment()
# A tibble: 702 × 13
   cca_0   sbp   dbp   ldl ageyrs totchol whratio .fitted  .resid    .hat .sigma
   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 0.925   130    80   4.5     58     6.7   0.9     0.936 -0.0113 0.00677  0.150
 2 0.975   120    80   2.3     48     4.4   0.924   0.875  0.0996 0.00359  0.150
 3 0.85    110    80   1.3     40     2.7   0.877   0.822  0.0284 0.00730  0.150
 4 0.875   120    80   3.5     38     5.1   0.869   0.831  0.0438 0.00301  0.150
 5 1.6     150   100   3.1     49     4.9   1.01    0.913  0.687  0.00863  0.148
 6 1.02    130    70   4.2     60     6.3   0.970   0.947  0.0777 0.00986  0.150
 7 1       106    71   2.1     42     3.8   0.949   0.838  0.162  0.00503  0.150
 8 1.32    130   100   3.2     55     4.9   0.843   0.913  0.412  0.0130   0.149
 9 1.1     110    80   3.3     36     5.1   0.814   0.811  0.289  0.00503  0.150
10 1       110    70   0.7     55     3.3   0.912   0.890  0.110  0.0172   0.150
# ℹ 692 more rows
# ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>
Code
lm1 %>% broom::glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.109         0.101 0.150      14.1 3.37e-15     6   338. -661. -624.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Code
lm1 %>% 
  broom::augment() %>% 
  ggplot(aes(x=.resid)) + 
  geom_histogram(fill = "blue", col = "black", alpha = .4) +
  theme_classic()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code
lm1 %>% 
  broom::augment() %>%
  ggpubr::ggqqplot(x = ".resid", col = "red", 
                   title = "QQ plot of the residuals of  the model") 

Code
lm1 %>% 
  plot(1, pch = 16, col = "skyblue")

Code
lm1 %>% 
  plot(2, pch = 16, col = "skyblue")

Code
lm1 %>% 
  plot(3, pch = 16, col = "skyblue")

Code
lm1 %>% 
  plot(4, pch = 16, col = "skyblue")

25.4 Checking multi-colinearity

Code
car::vif(lm1) %>% round(3)
    sbp     dbp     ldl  ageyrs totchol whratio 
  2.914   2.725   5.829   1.198   6.155   1.114 

25.5 Variable selection in multiple linear regression

Here we go through the drill of selecting the best model from a set of predictor variables. We do this with the help of the olsrr package.

25.5.1 Forward regression using the p-values for the selection of the best model

We begin by running the model as before

Code
lm1 <- lm(cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, data = dat)
summary(lm1)

Call:
lm(formula = cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, 
    data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33971 -0.10555 -0.01751  0.08800  0.68657 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4697581  0.0744135   6.313 4.88e-10 ***
sbp         0.0006056  0.0004044   1.498    0.135    
dbp         0.0002161  0.0006659   0.325    0.746    
ldl         0.0021276  0.0129757   0.164    0.870    
ageyrs      0.0044292  0.0006872   6.446 2.15e-10 ***
totchol     0.0035070  0.0104519   0.336    0.737    
whratio     0.0895525  0.0839122   1.067    0.286    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1502 on 695 degrees of freedom
Multiple R-squared:  0.1086,    Adjusted R-squared:  0.1009 
F-statistic: 14.11 on 6 and 695 DF,  p-value: 3.37e-15

Next, we perform the forward stepwise selection using a cut-off p.value of 0.1. The table below gives a summary of the selection criteria

Code
fwd.fit.pval <- olsrr::ols_step_forward_p(lm1, penter=0.1)
fwd.fit.pval

                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs        -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp           -665.999    -647.783    -2658.153    0.10505    0.10249 
 3      totchol       -665.472    -642.703    -2657.597    0.10692    0.10308 
 4      whratio       -664.658    -637.334    -2656.749    0.10843    0.10331 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.329       RMSE                  0.149 
R-Squared               0.108       MSE                   0.022 
Adj. R-Squared          0.103       Coef. Var            17.374 
Pred R-Squared          0.095       AIC                -664.658 
MAE                     0.117       SBC                -637.334 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.907          4          0.477    21.192    0.0000 
Residual       15.676        697          0.022                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.473         0.073                 6.461    0.000     0.330    0.617 
     ageyrs    0.004         0.001        0.251    6.461    0.000     0.003    0.006 
        sbp    0.001         0.000        0.106    2.742    0.006     0.000    0.001 
    totchol    0.005         0.004        0.043    1.132    0.258    -0.004    0.014 
    whratio    0.091         0.084        0.041    1.085    0.278    -0.073    0.255 
-------------------------------------------------------------------------------------

The final best-fitting selected model is as below. However, I will run it on the standardized version of our variables as the coefficients are too small

Code
dat2 <-  
  dat %>% 
  mutate_at(c("ageyrs", "sbp"), ~(scale(.) %>% as.vector))

lm.pval <- lm(cca_0 ~ sbp + ageyrs, data = dat2)
coefficients(lm.pval)
(Intercept)         sbp      ageyrs 
 0.86317664  0.01938752  0.04248649 
Code
lm.pval %>% flextable::as_flextable()

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

0.863

0.006

152.427

0.0000

***

sbp

0.019

0.006

3.289

0.0011

**

ageyrs

0.042

0.006

7.207

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.15 on 699 degrees of freedom

Multiple R-squared: 0.105, Adjusted R-squared: 0.1025

F-statistic: 41.02 on 699 and 2 DF, p-value: 0.0000

Code
lm.pval %>%
    broom::tidy(conf.int=T) %>% 
    mutate(across(estimate:conf.high,~round(.x, 3))) %>% 
    flextable::flextable() %>% 
    flextable::font(i=1:3, fontname = "Times New Roman") %>% 
    flextable::bold(i=1, j=1:7,part = "header") %>% 
    flextable::theme_zebra()

term

estimate

std.error

statistic

p.value

conf.low

conf.high

(Intercept)

0.863

0.006

152.427

0.000

0.852

0.874

sbp

0.019

0.006

3.289

0.001

0.008

0.031

ageyrs

0.042

0.006

7.207

0.000

0.031

0.054

Code
lm.pval %>% performance::check_collinearity()
# Check for Multicollinearity

Low Correlation

   Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
    sbp 1.08 [1.03, 1.23]         1.04      0.92     [0.81, 0.97]
 ageyrs 1.08 [1.03, 1.23]         1.04      0.92     [0.81, 0.97]
Code
lm.pval %>% performance::check_collinearity() %>% plot()

Code
lm.pval %>% performance::check_heteroscedasticity()
Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.011).
Code
lm.pval %>% performance::check_heteroscedasticity() %>% plot()

Code
lm.pval %>% performance::check_normality()
Warning: Non-normality of residuals detected (p < .001).
Code
lm.pval %>% performance::check_normality() %>% plot() 
For confidence bands, please install `qqplotr`.

Code
lm.pval %>% performance::check_predictions()
Warning: Maximum value of original data is not included in the
  replicated data.
  Model may not capture the variation of the data.

Code
lm.pval %>% performance::check_predictions() %>% plot()

Code
lm.pval %>% performance::check_outliers()
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Code
lm.pval %>% performance::check_outliers() %>% plot()

Code
plot(lm.pval)

25.5.2 Forward regression using the AIC for the selection of the best model

We change next to perform the forward stepwise selection using the AIC. The table below gives a summary of the selection criteria

Code
fwd.fit.aic <- 
    olsrr::ols_step_forward_aic(lm1)
fwd.fit.aic

                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs        -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp           -665.999    -647.783    -2658.153    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
-------------------------------------------------------------------------------------
Code
plot(fwd.fit.aic)

25.5.3 Backward regression using the p.values for the selection of the best model

We change next to perform the backward stepwise removal using the p-value. The table below gives a summary of the selection criteria

Code
bwd.fit.pval <- 
    olsrr::ols_step_backward_p(lm1, prem=0.1)
bwd.fit.pval

                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Full Model    -660.788    -624.357    -2652.837    0.10860    0.10090 
 1      ldl           -662.761    -630.883    -2654.831    0.10856    0.10216 
 2      dbp           -664.658    -637.334    -2656.749    0.10843    0.10331 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.329       RMSE                  0.149 
R-Squared               0.108       MSE                   0.022 
Adj. R-Squared          0.103       Coef. Var            17.374 
Pred R-Squared          0.095       AIC                -664.658 
MAE                     0.117       SBC                -637.334 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.907          4          0.477    21.192    0.0000 
Residual       15.676        697          0.022                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.473         0.073                 6.461    0.000     0.330    0.617 
        sbp    0.001         0.000        0.106    2.742    0.006     0.000    0.001 
     ageyrs    0.004         0.001        0.251    6.461    0.000     0.003    0.006 
    totchol    0.005         0.004        0.043    1.132    0.258    -0.004    0.014 
    whratio    0.091         0.084        0.041    1.085    0.278    -0.073    0.255 
-------------------------------------------------------------------------------------

25.5.4 Backward regression using the aic for the selection of the best model

Code
bwd.fit.aic <- 
    olsrr::ols_step_backward_aic(lm1)
bwd.fit.aic

                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Full Model    -660.788    -624.357    -2652.837    0.10860    0.10090 
 1      ldl           -662.761    -630.883    -2654.847    0.10856    0.10216 
 2      dbp           -664.658    -637.334    -2656.776    0.10843    0.10331 
 3      whratio       -665.472    -642.703    -2657.616    0.10692    0.10308 
 4      totchol       -665.999    -647.783    -2658.163    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
-------------------------------------------------------------------------------------
Code
plot(bwd.fit.aic)

25.5.5 Both direction regression using the p.values for the selection of the best model

We change next to perform the backward stepwise removal using the p-value. The table below gives a summary of the selection criteria

Code
both.fit.pval <- 
    olsrr::ols_step_both_p(
        lm1, 
        prem = 0.1, 
        penter = 0.1,  
        progress = TRUE)
Stepwise Selection Method 
-------------------------

Candidate Terms: 

1. sbp 
2. dbp 
3. ldl 
4. ageyrs 
5. totchol 
6. whratio 


Variables Added/Removed: 

=> ageyrs added 
=> sbp added 

No more variables to be added or removed.
Code
both.fit.pval

                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs (+)    -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp (+)       -665.999    -647.783    -2658.153    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
-------------------------------------------------------------------------------------

25.5.6 Both direction regression using the aic for the selection of the best model

Code
both.fit.aic <- 
    olsrr::ols_step_both_aic(lm1, progress = TRUE)
Stepwise Selection Method 
-------------------------

Candidate Terms: 

1. sbp 
2. dbp 
3. ldl 
4. ageyrs 
5. totchol 
6. whratio 


Variables Added/Removed: 

=> ageyrs added 
=> sbp added 

No more variables to be added or removed.
Code
both.fit.aic

                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs (+)    -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp (+)       -665.999    -647.783    -2658.153    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
-------------------------------------------------------------------------------------
Code
plot(both.fit.aic)

25.5.7 Pick the best predicting model.

The output below is divided into two tables. The first picks out the best 1, 2, 3, etc predictir model. Hence the best 3 predictor model for our linear regression is sbp ageyrs totchol. However, to pick the best model out of the lot, in case 6, we look at the second table. The best will have a high R-Square Adj.R-Square and Pred R-Square. Also, the best model will give the lower C(p), AIC, SBIC, SBC, MSEP, FPE, HSP, and APC. Looking at our output it appears the best will be the 2 predictor mode of lm(cca_0 ~ ageyrs + sbp).

Code
modcompare <- 
    olsrr::ols_step_best_subset(lm1)

modcompare
             Best Subsets Regression             
-------------------------------------------------
Model Index    Predictors
-------------------------------------------------
     1         ageyrs                             
     2         sbp ageyrs                         
     3         sbp ageyrs totchol                 
     4         sbp ageyrs totchol whratio         
     5         sbp dbp ageyrs totchol whratio     
     6         sbp dbp ldl ageyrs totchol whratio 
-------------------------------------------------

                                                     Subsets Regression Summary                                                      
-------------------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                               
Model    R-Square    R-Square    R-Square     C(p)         AIC          SBIC          SBC        MSEP       FPE       HSP       APC  
-------------------------------------------------------------------------------------------------------------------------------------
  1        0.0912      0.0899      0.0858    10.5636    -657.2199    -2649.4469    -643.5581    16.0250    0.0229    0.0000    0.9140 
  2        0.1050      0.1025      0.0968     1.7667    -665.9991    -2658.1525    -647.7834    15.8035    0.0226    0.0000    0.9026 
  3        0.1069      0.1031      0.0961     2.3041    -665.4722    -2657.5966    -642.7025    15.7930    0.0226    0.0000    0.9033 
  4        0.1084      0.1033      0.0952     3.1292    -664.6577    -2656.7487    -637.3341    15.7890    0.0227    0.0000    0.9044 
  5        0.1086      0.1022      0.0929     5.0269    -662.7610    -2654.8305    -630.8835    15.8093    0.0227    0.0000    0.9068 
  6        0.1086      0.1009      0.0912     7.0000    -660.7882    -2652.8371    -624.3567    15.8315    0.0228    0.0000    0.9094 
-------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria