25 Linear Regression

25.1 Acquiring the data

We begin by reading the data. Here we use the data from the Carotid Intima dataset

Code

dat <- 
  dget("C:/Dataset/cint_data_clean")%>%
  select(cca_0, sex, ageyrs, resid, hba1c, tobacco, alcohol, bmi, 
         whratio, totchol, ldl, hdl, trig, sbp, dbp)  %>%
  filter(!is.na(totchol), !is.na(ldl)) %>% 
  mutate(trig = if_else(trig > 40, trig/10, trig, missing = NULL))

Next, we summarize the data

Code

dat %>% 
    summarytools::dfSummary(graph.col = F)

Data Frame Summary  
dat  
Dimensions: 702 x 15  
Duplicates: 4  

-------------------------------------------------------------------------------------
No   Variable    Stats / Values            Freqs (% of Valid)    Valid      Missing  
---- ----------- ------------------------- --------------------- ---------- ---------
1    cca_0       Mean (sd) : 0.9 (0.2)     40 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.5 < 0.8 < 1.6                                                     
                 IQR (CV) : 0.2 (0.2)                                                

2    sex         1. Female                 545 (77.6%)           702        0        
     [factor]    2. Male                   157 (22.4%)           (100.0%)   (0.0%)   

3    ageyrs      Mean (sd) : 44.3 (9)      46 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 26 < 43 < 75                                                        
                 IQR (CV) : 12 (0.2)                                                 

4    resid       1. Rural                   24 ( 3.4%)           702        0        
     [factor]    2. Periurban              174 (24.8%)           (100.0%)   (0.0%)   
                 3. Urban                  504 (71.8%)                               

5    hba1c       Mean (sd) : 5.4 (1)       56 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 3.1 < 5.3 < 15.5                                                    
                 IQR (CV) : 0.8 (0.2)                                                

6    tobacco     1. No                     657 (93.6%)           702        0        
     [factor]    2. Yes                     45 ( 6.4%)           (100.0%)   (0.0%)   

7    alcohol     1. No                     458 (65.2%)           702        0        
     [factor]    2. Yes                    244 (34.8%)           (100.0%)   (0.0%)   

8    bmi         Mean (sd) : 26.6 (5.6)    582 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 13.3 < 26.1 < 45.3                                                  
                 IQR (CV) : 7.7 (0.2)                                                

9    whratio     Mean (sd) : 0.9 (0.1)     456 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.6 < 0.9 < 1.2                                                     
                 IQR (CV) : 0.1 (0.1)                                                

10   totchol     Mean (sd) : 5.1 (1.3)     73 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 1.1 < 5 < 11.1                                                      
                 IQR (CV) : 1.7 (0.3)                                                

11   ldl         Mean (sd) : 3.2 (1.1)     59 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0 < 3.1 < 8.1                                                       
                 IQR (CV) : 1.4 (0.3)                                                

12   hdl         Mean (sd) : 1.4 (0.5)     31 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.3 < 1.3 < 4.4                                                     
                 IQR (CV) : 0.6 (0.4)                                                

13   trig        Mean (sd) : 1.4 (0.9)     227 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 0.3 < 1.2 < 10.7                                                    
                 IQR (CV) : 0.8 (0.7)                                                

14   sbp         Mean (sd) : 125 (23.9)    118 distinct values   702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 66 < 121 < 231                                                      
                 IQR (CV) : 28 (0.2)                                                 

15   dbp         Mean (sd) : 79.5 (14.1)   71 distinct values    702        0        
     [numeric]   min < med < max:                                (100.0%)   (0.0%)   
                 43 < 79 < 135                                                       
                 IQR (CV) : 19 (0.2)                                                 
-------------------------------------------------------------------------------------

25.2 Linear regression with a single continuous variable

We begin by looking at the relationship between the dependent and independent variables

Code

dat %>% 
  ggplot(aes(x = ageyrs, y = cca_0)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x) +
  labs(
      x = "Age in years", 
      y = "Common Carotid Intima thickness",
      title = "Relationship between CCA and Age of patient") +
  theme_classic()

Since the relationship between the two looks linear we will go on to fit the model

Code

lm.1 <- 
    lm(cca_0 ~ ageyrs, data = dat)

Next w,e summarise the model, extract the coefficients and confidence intervals with the help of the flextable package.

Code

lm.1 %>% flextable::as_flextable()

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	0.628	0.029	21.988	0.0000	***
ageyrs	0.005	0.001	8.381	0.0000	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 0.1511 on 700 degrees of freedom
Multiple R-squared: 0.0912, Adjusted R-squared: 0.0899
F-statistic: 70.25 on 700 and 1 DF, p-value: 0.0000

Next, we extract some regression analysis stats required for the regression diagnostics

Code

tibble(
    resid = residuals(lm.1), 
    fits = fitted(lm.1), 
    st.resid = rstandard(lm.1),
    cookd = cooks.distance(lm.1),
    covr = covratio(lm.1),
    hatv = hatvalues(lm.1),
    dfit = dffits(lm.1),
    dfbeta1 = dfbeta(lm.1)[,2]) %>% 
  arrange(desc(cookd)) %>% 
  round(5) %>% 
  slice(1:10) %>% 
    kableExtra::kable()

resid	fits	st.resid	cookd	covr	hatv	dfit	dfbeta1
0.40242	0.97258	2.67540	0.03210	0.99127	0.00889	0.25449	0.00015
0.52507	0.92493	3.48188	0.02314	0.97212	0.00380	0.21685	0.00011
-0.25435	1.00435	-1.69521	0.02018	1.00862	0.01385	-0.20119	-0.00012
0.71213	0.88787	4.71760	0.02012	0.94181	0.00180	0.20372	0.00006
0.53625	0.81375	3.55451	0.01868	0.96985	0.00295	0.19491	-0.00009
0.34330	0.95670	2.28006	0.01800	0.99487	0.00688	0.19033	0.00011
0.19948	1.02552	1.33222	0.01614	1.01593	0.01786	0.17975	0.00011
-0.24376	0.99376	-1.62316	0.01608	1.00748	0.01206	-0.17953	-0.00011
0.35389	0.94611	2.34901	0.01585	0.99279	0.00571	0.17864	0.00010
-0.25287	0.97787	-1.68180	0.01375	1.00445	0.00963	-0.16604	-0.00010

And then plot the regression diagnostic graphs

Code

opar <- par(mfrow = c(2,2))
plot(lm.1, pch = 18, cex=.5)

Code

par(opar)

Below we formally check the model assumption with the performance package. First the normality of the residuals

Code

performance::check_predictions(lm.1)

Warning: Maximum value of original data is not included in the
  replicated data.
  Model may not capture the variation of the data.

Code

performance::check_normality(lm.1)

Warning: Non-normality of residuals detected (p < .001).

Code

performance::check_normality(lm.1) %>% plot()

For confidence bands, please install `qqplotr`.

Next heteroscedasticity

Code

performance::check_heteroscedasticity(lm.1)

Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.018).

Code

performance::check_heteroscedasticity(lm.1) %>% 
    plot()

Code

performance::check_distribution(lm.1)

# Distribution of Model Family

Predicted Distribution of Residuals

 Distribution Probability
       normal         34%
         beta         19%
       cauchy         16%

Predicted Distribution of Response

  Distribution Probability
 inverse-gamma         28%
         gamma         22%
          beta         16%

Code

performance::check_distribution(lm.1) %>% 
    plot()

25.2.1 Linear regression with one continuous and one categorical predictor

Code

lm.2 <- 
    lm(cca_0 ~ ageyrs + sex, data = dat)

lm.2 %>% 
    summary()


Call:
lm(formula = cca_0 ~ ageyrs + sex, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35690 -0.11033 -0.01512  0.08711  0.71810 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.6301958  0.0285622  22.064  < 2e-16 ***
ageyrs      0.0051368  0.0006377   8.056 3.42e-15 ***
sexMale     0.0234025  0.0138148   1.694   0.0907 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1509 on 699 degrees of freedom
Multiple R-squared:  0.09491,   Adjusted R-squared:  0.09233 
F-statistic: 36.65 on 2 and 699 DF,  p-value: 7.294e-16

And then plot the diagnostics

Code

opar <- par(mfrow = c(2,2))
lm.2 %>% 
    plot(pch = 18, cex=.5)

Code

par(opar)

Further, we plot the marginal plot for the various sexes

Code

dat %>% 
  ggplot(aes(x = ageyrs, y = cca_0, col = sex)) +
  geom_smooth(formula = y~x, method = "lm") +
  geom_point() +
  labs(
      x = "Age in years", 
      y = "CCA", 
      title = "CCA vrs Age for each sex") +
  theme_bw()

Linear regression with one continuous and one categorical predictor with interaction

Code

lm.3 <- 
    lm(cca_0 ~ ageyrs*sex, data = dat)

lm.3 %>% 
    summary()


Call:
lm(formula = cca_0 ~ ageyrs * sex, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35711 -0.10997 -0.01482  0.08664  0.71789 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.6284770  0.0330494  19.016  < 2e-16 ***
ageyrs          0.0051762  0.0007429   6.968 7.45e-12 ***
sexMale         0.0303114  0.0681101   0.445    0.656    
ageyrs:sexMale -0.0001503  0.0014510  -0.104    0.918    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.151 on 698 degrees of freedom
Multiple R-squared:  0.09493,   Adjusted R-squared:  0.09104 
F-statistic:  24.4 on 3 and 698 DF,  p-value: 5.025e-15

And then plot the diagnostics

Code

opar <- par(mfrow = c(2,2))
lm.3 %>% 
    plot(pch = 18, cex=.5)

Code

par(opar)

Next, we make a plot of the marginal effects after

Code

pr.3 <- 
    ggeffects::ggpredict(lm.3, terms = c("ageyrs", "sex"))

pr.3 %>% 
  plot() +
  labs(
      x = "Age in years", 
      y = "CCA thickness", 
      title = "Marginal relationship")

Linear regression with one continuous and two categorical predictors with interaction

Code

lm.4 <- 
    lm(cca_0 ~ ageyrs*sex*resid, data = dat)

lm.4 %>% 
    summary()


Call:
lm(formula = cca_0 ~ ageyrs * sex * resid, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36621 -0.10171 -0.01644  0.08202  0.70879 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)   
(Intercept)                    0.434224   0.155773   2.788  0.00546 **
ageyrs                         0.010017   0.003277   3.057  0.00232 **
sexMale                       -1.476681   1.361101  -1.085  0.27834   
residPeriurban                 0.171485   0.167778   1.022  0.30710   
residUrban                     0.213439   0.160840   1.327  0.18494   
ageyrs:sexMale                 0.034164   0.031289   1.092  0.27527   
ageyrs:residPeriurban         -0.005066   0.003557  -1.424  0.15476   
ageyrs:residUrban             -0.005046   0.003400  -1.484  0.13821   
sexMale:residPeriurban         1.661961   1.367568   1.215  0.22468   
sexMale:residUrban             1.452145   1.363439   1.065  0.28722   
ageyrs:sexMale:residPeriurban -0.035862   0.031413  -1.142  0.25401   
ageyrs:sexMale:residUrban     -0.033653   0.031336  -1.074  0.28322   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1499 on 690 degrees of freedom
Multiple R-squared:  0.1184,    Adjusted R-squared:  0.1044 
F-statistic: 8.427 on 11 and 690 DF,  p-value: 5.047e-14

And then plot the diagnostics

Code

opar <- par(mfrow = c(2,2))
lm.4 %>% 
    plot(pch = 18, cex=.5)

Code

par(opar)

Next, we make a plot of the marginal effects after

Code

pr.4 <- ggeffects::ggpredict(lm.4, terms = c("ageyrs", "sex", "resid"))
pr.4 %>% 
  plot() +
  labs(x = "Age in years", y = "CCA thickness", title = "Marginal relationship")

There is a set of criteria for evaluating linear regression equations. These are:

There must be a linear relationship between the predictor and dependent variables
The residuals must be independent
Homoscedaticity: The variance of the residuals must be uniform at every level of x
The residual must be normally distributed

Code

corr.mat <- 
  dat %>% 
  select(cca_0, ageyrs, whratio, totchol, ldl, sbp, dbp) %>% 
  cor() %>% 
  round(4)

corr.mat %>% 
  kableExtra::kbl(align = c(rep("l", 7)), caption = "Correlation matrix") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "bordered", "condensed", "hover", "responsive"), 
                            full_width= FALSE, html_font = "san_serif", font_size = 16)

Correlation matrix
	cca_0	ageyrs	whratio	totchol	ldl	sbp	dbp
cca_0	1.0000	0.3020	0.1398	0.1387	0.1131	0.1963	0.1538
ageyrs	0.3020	1.0000	0.2962	0.2196	0.1563	0.2755	0.1782
whratio	0.1398	0.2962	1.0000	0.1521	0.1247	0.1681	0.1509
totchol	0.1387	0.2196	0.1521	1.0000	0.9086	0.3198	0.2647
ldl	0.1131	0.1563	0.1247	0.9086	1.0000	0.2553	0.2069
sbp	0.1963	0.2755	0.1681	0.3198	0.2553	1.0000	0.7936
dbp	0.1538	0.1782	0.1509	0.2647	0.2069	0.7936	1.0000

And then graphically present the correlation matrix

Code

dat %>% 
  select(cca_0, ageyrs, whratio, totchol, ldl, sbp, dbp) %>% 
  pairs(pch=".")

Code

corrplot::corrplot(corr.mat)

Both the correlation matrix and the diagram indicate a high correlation between

sbp and dbp
totchol, ldl and hdl

These could potentially be indicative of multicollinearity

25.3 Fitting the model

We then fit the linear model and summarize it

Code

lm1 <- lm(cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, data = dat)
summary(lm1)


Call:
lm(formula = cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, 
    data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33971 -0.10555 -0.01751  0.08800  0.68657 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4697581  0.0744135   6.313 4.88e-10 ***
sbp         0.0006056  0.0004044   1.498    0.135    
dbp         0.0002161  0.0006659   0.325    0.746    
ldl         0.0021276  0.0129757   0.164    0.870    
ageyrs      0.0044292  0.0006872   6.446 2.15e-10 ***
totchol     0.0035070  0.0104519   0.336    0.737    
whratio     0.0895525  0.0839122   1.067    0.286    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1502 on 695 degrees of freedom
Multiple R-squared:  0.1086,    Adjusted R-squared:  0.1009 
F-statistic: 14.11 on 6 and 695 DF,  p-value: 3.37e-15

The broom package gives us the opportunity of extracting the coefficient, residual, etc into a dataframe. We next do this below

Code

lm1 %>% broom::tidy()

# A tibble: 7 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 0.470     0.0744       6.31  4.88e-10
2 sbp         0.000606  0.000404     1.50  1.35e- 1
3 dbp         0.000216  0.000666     0.325 7.46e- 1
4 ldl         0.00213   0.0130       0.164 8.70e- 1
5 ageyrs      0.00443   0.000687     6.45  2.15e-10
6 totchol     0.00351   0.0105       0.336 7.37e- 1
7 whratio     0.0896    0.0839       1.07  2.86e- 1

Code

lm1 %>% broom::augment()

# A tibble: 702 × 13
   cca_0   sbp   dbp   ldl ageyrs totchol whratio .fitted  .resid    .hat .sigma
   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 0.925   130    80   4.5     58     6.7   0.9     0.936 -0.0113 0.00677  0.150
 2 0.975   120    80   2.3     48     4.4   0.924   0.875  0.0996 0.00359  0.150
 3 0.85    110    80   1.3     40     2.7   0.877   0.822  0.0284 0.00730  0.150
 4 0.875   120    80   3.5     38     5.1   0.869   0.831  0.0438 0.00301  0.150
 5 1.6     150   100   3.1     49     4.9   1.01    0.913  0.687  0.00863  0.148
 6 1.02    130    70   4.2     60     6.3   0.970   0.947  0.0777 0.00986  0.150
 7 1       106    71   2.1     42     3.8   0.949   0.838  0.162  0.00503  0.150
 8 1.32    130   100   3.2     55     4.9   0.843   0.913  0.412  0.0130   0.149
 9 1.1     110    80   3.3     36     5.1   0.814   0.811  0.289  0.00503  0.150
10 1       110    70   0.7     55     3.3   0.912   0.890  0.110  0.0172   0.150
# ℹ 692 more rows
# ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>

Code

lm1 %>% broom::glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.109         0.101 0.150      14.1 3.37e-15     6   338. -661. -624.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Code

lm1 %>% 
  broom::augment() %>% 
  ggplot(aes(x=.resid)) + 
  geom_histogram(fill = "blue", col = "black", alpha = .4) +
  theme_classic()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code

lm1 %>% 
  broom::augment() %>%
  ggpubr::ggqqplot(x = ".resid", col = "red", 
                   title = "QQ plot of the residuals of  the model")

Code

lm1 %>% 
  plot(1, pch = 16, col = "skyblue")

Code

lm1 %>% 
  plot(2, pch = 16, col = "skyblue")

Code

lm1 %>% 
  plot(3, pch = 16, col = "skyblue")

Code

lm1 %>% 
  plot(4, pch = 16, col = "skyblue")

25.4 Checking multi-colinearity

Code

car::vif(lm1) %>% round(3)

    sbp     dbp     ldl  ageyrs totchol whratio 
  2.914   2.725   5.829   1.198   6.155   1.114

25.5 Variable selection in multiple linear regression

Here we go through the drill of selecting the best model from a set of predictor variables. We do this with the help of the olsrr package.

25.5.1 Forward regression using the p-values for the selection of the best model

We begin by running the model as before

Code

lm1 <- lm(cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, data = dat)
summary(lm1)


Call:
lm(formula = cca_0 ~ sbp + dbp + ldl + ageyrs + totchol + whratio, 
    data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33971 -0.10555 -0.01751  0.08800  0.68657 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4697581  0.0744135   6.313 4.88e-10 ***
sbp         0.0006056  0.0004044   1.498    0.135    
dbp         0.0002161  0.0006659   0.325    0.746    
ldl         0.0021276  0.0129757   0.164    0.870    
ageyrs      0.0044292  0.0006872   6.446 2.15e-10 ***
totchol     0.0035070  0.0104519   0.336    0.737    
whratio     0.0895525  0.0839122   1.067    0.286    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1502 on 695 degrees of freedom
Multiple R-squared:  0.1086,    Adjusted R-squared:  0.1009 
F-statistic: 14.11 on 6 and 695 DF,  p-value: 3.37e-15

Next, we perform the forward stepwise selection using a cut-off p.value of 0.1. The table below gives a summary of the selection criteria

Code

fwd.fit.pval <- olsrr::ols_step_forward_p(lm1, penter=0.1)
fwd.fit.pval


                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs        -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp           -665.999    -647.783    -2658.153    0.10505    0.10249 
 3      totchol       -665.472    -642.703    -2657.597    0.10692    0.10308 
 4      whratio       -664.658    -637.334    -2656.749    0.10843    0.10331 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.329       RMSE                  0.149 
R-Squared               0.108       MSE                   0.022 
Adj. R-Squared          0.103       Coef. Var            17.374 
Pred R-Squared          0.095       AIC                -664.658 
MAE                     0.117       SBC                -637.334 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.907          4          0.477    21.192    0.0000 
Residual       15.676        697          0.022                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.473         0.073                 6.461    0.000     0.330    0.617 
     ageyrs    0.004         0.001        0.251    6.461    0.000     0.003    0.006 
        sbp    0.001         0.000        0.106    2.742    0.006     0.000    0.001 
    totchol    0.005         0.004        0.043    1.132    0.258    -0.004    0.014 
    whratio    0.091         0.084        0.041    1.085    0.278    -0.073    0.255 
-------------------------------------------------------------------------------------

The final best-fitting selected model is as below. However, I will run it on the standardized version of our variables as the coefficients are too small

Code

dat2 <-  
  dat %>% 
  mutate_at(c("ageyrs", "sbp"), ~(scale(.) %>% as.vector))

lm.pval <- lm(cca_0 ~ sbp + ageyrs, data = dat2)
coefficients(lm.pval)

(Intercept)         sbp      ageyrs 
 0.86317664  0.01938752  0.04248649

Code

lm.pval %>% flextable::as_flextable()

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	0.863	0.006	152.427	0.0000	***
sbp	0.019	0.006	3.289	0.0011	**
ageyrs	0.042	0.006	7.207	0.0000	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 0.15 on 699 degrees of freedom
Multiple R-squared: 0.105, Adjusted R-squared: 0.1025
F-statistic: 41.02 on 699 and 2 DF, p-value: 0.0000

Code

lm.pval %>%
    broom::tidy(conf.int=T) %>% 
    mutate(across(estimate:conf.high,~round(.x, 3))) %>% 
    flextable::flextable() %>% 
    flextable::font(i=1:3, fontname = "Times New Roman") %>% 
    flextable::bold(i=1, j=1:7,part = "header") %>% 
    flextable::theme_zebra()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.863	0.006	152.427	0.000	0.852	0.874
sbp	0.019	0.006	3.289	0.001	0.008	0.031
ageyrs	0.042	0.006	7.207	0.000	0.031	0.054

Code

lm.pval %>% performance::check_collinearity()

# Check for Multicollinearity

Low Correlation

   Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
    sbp 1.08 [1.03, 1.23]         1.04      0.92     [0.81, 0.97]
 ageyrs 1.08 [1.03, 1.23]         1.04      0.92     [0.81, 0.97]

Code

lm.pval %>% performance::check_collinearity() %>% plot()

Code

lm.pval %>% performance::check_heteroscedasticity()

Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.011).

Code

lm.pval %>% performance::check_heteroscedasticity() %>% plot()

Code

lm.pval %>% performance::check_normality()

Warning: Non-normality of residuals detected (p < .001).

Code

lm.pval %>% performance::check_normality() %>% plot()

For confidence bands, please install `qqplotr`.

Code

lm.pval %>% performance::check_predictions()

Warning: Maximum value of original data is not included in the
  replicated data.
  Model may not capture the variation of the data.

Code

lm.pval %>% performance::check_predictions() %>% plot()

Code

lm.pval %>% performance::check_outliers()

OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)

Code

lm.pval %>% performance::check_outliers() %>% plot()

Code

plot(lm.pval)

25.5.2 Forward regression using the AIC for the selection of the best model

We change next to perform the forward stepwise selection using the AIC. The table below gives a summary of the selection criteria

Code

fwd.fit.aic <- 
    olsrr::ols_step_forward_aic(lm1)
fwd.fit.aic


                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs        -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp           -665.999    -647.783    -2658.153    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
-------------------------------------------------------------------------------------

Code

plot(fwd.fit.aic)

25.5.3 Backward regression using the p.values for the selection of the best model

We change next to perform the backward stepwise removal using the p-value. The table below gives a summary of the selection criteria

Code

bwd.fit.pval <- 
    olsrr::ols_step_backward_p(lm1, prem=0.1)
bwd.fit.pval


                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Full Model    -660.788    -624.357    -2652.837    0.10860    0.10090 
 1      ldl           -662.761    -630.883    -2654.831    0.10856    0.10216 
 2      dbp           -664.658    -637.334    -2656.749    0.10843    0.10331 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.329       RMSE                  0.149 
R-Squared               0.108       MSE                   0.022 
Adj. R-Squared          0.103       Coef. Var            17.374 
Pred R-Squared          0.095       AIC                -664.658 
MAE                     0.117       SBC                -637.334 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.907          4          0.477    21.192    0.0000 
Residual       15.676        697          0.022                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.473         0.073                 6.461    0.000     0.330    0.617 
        sbp    0.001         0.000        0.106    2.742    0.006     0.000    0.001 
     ageyrs    0.004         0.001        0.251    6.461    0.000     0.003    0.006 
    totchol    0.005         0.004        0.043    1.132    0.258    -0.004    0.014 
    whratio    0.091         0.084        0.041    1.085    0.278    -0.073    0.255 
-------------------------------------------------------------------------------------

25.5.4 Backward regression using the aic for the selection of the best model

Code

bwd.fit.aic <- 
    olsrr::ols_step_backward_aic(lm1)
bwd.fit.aic


                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Full Model    -660.788    -624.357    -2652.837    0.10860    0.10090 
 1      ldl           -662.761    -630.883    -2654.847    0.10856    0.10216 
 2      dbp           -664.658    -637.334    -2656.776    0.10843    0.10331 
 3      whratio       -665.472    -642.703    -2657.616    0.10692    0.10308 
 4      totchol       -665.999    -647.783    -2658.163    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
-------------------------------------------------------------------------------------

Code

plot(bwd.fit.aic)

25.5.5 Both direction regression using the p.values for the selection of the best model

We change next to perform the backward stepwise removal using the p-value. The table below gives a summary of the selection criteria

Code

both.fit.pval <- 
    olsrr::ols_step_both_p(
        lm1, 
        prem = 0.1, 
        penter = 0.1,  
        progress = TRUE)

Stepwise Selection Method 
-------------------------

Candidate Terms: 

1. sbp 
2. dbp 
3. ldl 
4. ageyrs 
5. totchol 
6. whratio 


Variables Added/Removed: 

=> ageyrs added 
=> sbp added 

No more variables to be added or removed.

Code

both.fit.pval


                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs (+)    -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp (+)       -665.999    -647.783    -2658.153    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
-------------------------------------------------------------------------------------

25.5.6 Both direction regression using the aic for the selection of the best model

Code

both.fit.aic <- 
    olsrr::ols_step_both_aic(lm1, progress = TRUE)

Stepwise Selection Method 
-------------------------

Candidate Terms: 

1. sbp 
2. dbp 
3. ldl 
4. ageyrs 
5. totchol 
6. whratio 


Variables Added/Removed: 

=> ageyrs added 
=> sbp added 

No more variables to be added or removed.

Code

both.fit.aic


                               Stepwise Summary                                
-----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------
 0      Base Model    -592.088    -582.980    -2584.497    0.00000    0.00000 
 1      ageyrs (+)    -657.220    -643.558    -2649.447    0.09120    0.08990 
 2      sbp (+)       -665.999    -647.783    -2658.153    0.10505    0.10249 
-----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.324       RMSE                  0.150 
R-Squared               0.105       MSE                   0.022 
Adj. R-Squared          0.102       Coef. Var            17.382 
Pred R-Squared          0.097       AIC                -665.999 
MAE                     0.117       SBC                -647.783 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression      1.847          2          0.924    41.023    0.0000 
Residual       15.736        699          0.023                     
Total          17.583        701                                    
--------------------------------------------------------------------

                                 Parameter Estimates                                  
-------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
-------------------------------------------------------------------------------------
(Intercept)    0.553         0.036                 15.199    0.000    0.482    0.625 
     ageyrs    0.005         0.001        0.268     7.207    0.000    0.003    0.006 
        sbp    0.001         0.000        0.122     3.289    0.001    0.000    0.001 
-------------------------------------------------------------------------------------

Code

plot(both.fit.aic)

25.5.7 Pick the best predicting model.

The output below is divided into two tables. The first picks out the best 1, 2, 3, etc predictir model. Hence the best 3 predictor model for our linear regression is sbp ageyrs totchol. However, to pick the best model out of the lot, in case 6, we look at the second table. The best will have a high R-Square Adj.R-Square and Pred R-Square. Also, the best model will give the lower C(p), AIC, SBIC, SBC, MSEP, FPE, HSP, and APC. Looking at our output it appears the best will be the 2 predictor mode of lm(cca_0 ~ ageyrs + sbp).

Code

modcompare <- 
    olsrr::ols_step_best_subset(lm1)

modcompare

             Best Subsets Regression             
-------------------------------------------------
Model Index    Predictors
-------------------------------------------------
     1         ageyrs                             
     2         sbp ageyrs                         
     3         sbp ageyrs totchol                 
     4         sbp ageyrs totchol whratio         
     5         sbp dbp ageyrs totchol whratio     
     6         sbp dbp ldl ageyrs totchol whratio 
-------------------------------------------------

                                                     Subsets Regression Summary                                                      
-------------------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                               
Model    R-Square    R-Square    R-Square     C(p)         AIC          SBIC          SBC        MSEP       FPE       HSP       APC  
-------------------------------------------------------------------------------------------------------------------------------------
  1        0.0912      0.0899      0.0858    10.5636    -657.2199    -2649.4469    -643.5581    16.0250    0.0229    0.0000    0.9140 
  2        0.1050      0.1025      0.0968     1.7667    -665.9991    -2658.1525    -647.7834    15.8035    0.0226    0.0000    0.9026 
  3        0.1069      0.1031      0.0961     2.3041    -665.4722    -2657.5966    -642.7025    15.7930    0.0226    0.0000    0.9033 
  4        0.1084      0.1033      0.0952     3.1292    -664.6577    -2656.7487    -637.3341    15.7890    0.0227    0.0000    0.9044 
  5        0.1086      0.1022      0.0929     5.0269    -662.7610    -2654.8305    -630.8835    15.8093    0.0227    0.0000    0.9068 
  6        0.1086      0.1009      0.0912     7.0000    -660.7882    -2652.8371    -624.3567    15.8315    0.0228    0.0000    0.9094 
-------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria