21  Single Continuous Variable

This section will demonstrate the use of simple linear regression to describe the relationship between variables. We begin by importing the blood data and summarizing it as below:

Code
df_blood <- 
    read_csv("C:/Dataset/blood.csv") %>% 
    select(hb, hct)

df_blood %>% 
    summarytools::dfSummary(graph.col = F, labels.col = F)
Data Frame Summary  
df_blood  
Dimensions: 50 x 2  
Duplicates: 0  

-----------------------------------------------------------------------------------
No   Variable    Stats / Values           Freqs (% of Valid)   Valid      Missing  
---- ----------- ------------------------ -------------------- ---------- ---------
1    hb          Mean (sd) : 8.2 (1.8)    36 distinct values   50         0        
     [numeric]   min < med < max:                              (100.0%)   (0.0%)   
                 5.3 < 7.7 < 12                                                    
                 IQR (CV) : 2.6 (0.2)                                              

2    hct         Mean (sd) : 24.4 (5.1)   45 distinct values   50         0        
     [numeric]   min < med < max:                              (100.0%)   (0.0%)   
                 15.7 < 23 < 35                                                    
                 IQR (CV) : 7.4 (0.2)                                              
-----------------------------------------------------------------------------------

21.1 Plotting

We begin by plotting the distribution of the variables involved

Code
df_blood %>% 
    pivot_longer(cols = c(hb, hct)) %>% 
    ggplot(aes( x = value)) +
    geom_histogram(bins = 8, fill = "gold", color = "black")+
    facet_wrap(facets = "name", scales = "free") +
    theme_bw()
Figure 21.1: Relationship between hemoglobin and hematocrit

Next, we plot the relationship between the hctand hb variables and note the linear relationship.

Code
df_blood %>% 
    ggplot(aes(x = hct, y = hb)) +
    geom_point()+
    geom_smooth(formula = y ~ x, method = "lm", se = FALSE)+
    theme_bw()
Figure 21.2: Relationship between hemoglobin and hematocrit

21.2 Assumptions

  1. Linearity: The relationship between the independent variable (X) and the dependent variable (Y) is linear.
  2. Independence: The observations are independent of each other.
  3. Homoscedasticity: The residuals (errors) have constant variance at every level of X.
  4. Normality: The residuals of the model are normally distributed.
  5. No multicollinearity: This is usually more relevant for multiple regression, but it means the independent variables aren’t too highly correlated with each other.

21.3 Model fitting

Code
model <- 
    df_blood %>% 
    lm(hb ~ hct, data = .)

21.4 Visualising model

21.4.1 R base summary

Code
model %>% summary()

Call:
lm(formula = hb ~ hct, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77666 -0.17021  0.02036  0.16771  0.63128 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.314403   0.222547  -1.413    0.164    
hct          0.347682   0.008925  38.957   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3184 on 48 degrees of freedom
Multiple R-squared:  0.9693,    Adjusted R-squared:  0.9687 
F-statistic:  1518 on 1 and 48 DF,  p-value: < 2.2e-16

21.4.2 tab_model

Code
model %>% sjPlot::tab_model() 
  hb
Predictors Estimates CI p
(Intercept) -0.31 -0.76 – 0.13 0.164
hct 0.35 0.33 – 0.37 <0.001
Observations 50
R2 / R2 adjusted 0.969 / 0.969

21.4.3 tidy

Code
model %>% broom::tidy() %>% kableExtra::kable()
term estimate std.error statistic p.value
(Intercept) -0.3144029 0.2225472 -1.412747 0.1641817
hct 0.3476823 0.0089248 38.956816 0.0000000

21.4.4 tbl_uvregression

Code
df_blood %>% 
    gtsummary::tbl_uvregression(
        y = hb,
        method = "lm"
    )
Characteristic N Beta 95% CI1 p-value
hct 50 0.35 0.33, 0.37 <0.001
1 CI = Confidence Interval

21.5 Checking Assumptions

We see no significant violation of the model assumptions

Code
performance::check_model(model)
Figure 21.3: Model Assumptions of simple linear regression

21.6 Prediction interval

Code
model %>% 
    predict(interval = "predict") %>% 
    as_tibble() %>% 
    bind_cols(df_blood) %>% 
    ggplot(aes(x = hct, y = hb)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y~x, se=T)+
    geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") +
    geom_line(aes(y = upr), col = "coral2", linetype = "dashed") +
    labs(
        x = "HCT (%)", 
        y = "HB (mg/dl)", 
        caption = "Nurse Data 2015")+
    theme_bw()
Figure 21.4: Relationship between HB4 and HCT4 with fillted line, prediction and se intervals”
Code
model %>% 
    emmeans::emmeans(~hct, at = list(hct = c(20, 25, 30, 35)))
 hct emmean     SE df lower.CL upper.CL
  20   6.64 0.0599 48     6.52     6.76
  25   8.38 0.0453 48     8.29     8.47
  30  10.12 0.0671 48     9.98    10.25
  35  11.85 0.1050 48    11.64    12.06

Confidence level used: 0.95 

21.7 Report

Code
report::report(model)
We fitted a linear model (estimated using OLS) to predict hb with hct (formula:
hb ~ hct). The model explains a statistically significant and substantial
proportion of variance (R2 = 0.97, F(1, 48) = 1517.63, p < .001, adj. R2 =
0.97). The model's intercept, corresponding to hct = 0, is at -0.31 (95% CI
[-0.76, 0.13], t(48) = -1.41, p = 0.164). Within this model:

  - The effect of hct is statistically significant and positive (beta = 0.35, 95%
CI [0.33, 0.37], t(48) = 38.96, p < .001; Std. beta = 0.98, 95% CI [0.93,
1.04])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.