21 Single Continuous Variable

This section will demonstrate the use of simple linear regression to describe the relationship between variables. We begin by importing the blood data and summarizing it as below:

Code

df_blood <- 
    read_csv("C:/Dataset/blood.csv") %>% 
    select(hb, hct)

df_blood %>% 
    summarytools::dfSummary(graph.col = F, labels.col = F)

Data Frame Summary  
df_blood  
Dimensions: 50 x 2  
Duplicates: 0  

-----------------------------------------------------------------------------------
No   Variable    Stats / Values           Freqs (% of Valid)   Valid      Missing  
---- ----------- ------------------------ -------------------- ---------- ---------
1    hb          Mean (sd) : 8.2 (1.8)    36 distinct values   50         0        
     [numeric]   min < med < max:                              (100.0%)   (0.0%)   
                 5.3 < 7.7 < 12                                                    
                 IQR (CV) : 2.6 (0.2)                                              

2    hct         Mean (sd) : 24.4 (5.1)   45 distinct values   50         0        
     [numeric]   min < med < max:                              (100.0%)   (0.0%)   
                 15.7 < 23 < 35                                                    
                 IQR (CV) : 7.4 (0.2)                                              
-----------------------------------------------------------------------------------

21.1 Plotting

We begin by plotting the distribution of the variables involved

Code

df_blood %>% 
    pivot_longer(cols = c(hb, hct)) %>% 
    ggplot(aes( x = value)) +
    geom_histogram(bins = 8, fill = "gold", color = "black")+
    facet_wrap(facets = "name", scales = "free") +
    theme_bw()

Figure 21.1: Relationship between hemoglobin and hematocrit

Next, we plot the relationship between the hctand hb variables and note the linear relationship.

Code

df_blood %>% 
    ggplot(aes(x = hct, y = hb)) +
    geom_point()+
    geom_smooth(formula = y ~ x, method = "lm", se = FALSE)+
    theme_bw()

Figure 21.2: Relationship between hemoglobin and hematocrit

21.2 Assumptions

Linearity: The relationship between the independent variable (X) and the dependent variable (Y) is linear.
Independence: The observations are independent of each other.
Homoscedasticity: The residuals (errors) have constant variance at every level of X.
Normality: The residuals of the model are normally distributed.
No multicollinearity: This is usually more relevant for multiple regression, but it means the independent variables aren’t too highly correlated with each other.

21.3 Model fitting

Code

model <- 
    df_blood %>% 
    lm(hb ~ hct, data = .)

21.4 Visualising model

21.4.1 R base `summary`

Code

model %>% summary()


Call:
lm(formula = hb ~ hct, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77666 -0.17021  0.02036  0.16771  0.63128 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.314403   0.222547  -1.413    0.164    
hct          0.347682   0.008925  38.957   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3184 on 48 degrees of freedom
Multiple R-squared:  0.9693,    Adjusted R-squared:  0.9687 
F-statistic:  1518 on 1 and 48 DF,  p-value: < 2.2e-16

21.4.2 `tab_model`

Code

model %>% sjPlot::tab_model()

	hb
Predictors	Estimates	CI	p
(Intercept)	-0.31	-0.76 – 0.13	0.164
hct	0.35	0.33 – 0.37	<0.001
Observations	50
R² / R² adjusted	0.969 / 0.969

21.4.3 `tidy`

Code

model %>% broom::tidy() %>% kableExtra::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	-0.3144029	0.2225472	-1.412747	0.1641817
hct	0.3476823	0.0089248	38.956816	0.0000000

21.4.4 `tbl_uvregression`

Code

df_blood %>% 
    gtsummary::tbl_uvregression(
        y = hb,
        method = "lm"
    )

Characteristic	N	Beta	95% CI¹	p-value
hct	50	0.35	0.33, 0.37	<0.001
¹ CI = Confidence Interval

21.5 Checking Assumptions

We see no significant violation of the model assumptions

Code

performance::check_model(model)

Figure 21.3: Model Assumptions of simple linear regression

21.6 Prediction interval

Code

model %>% 
    predict(interval = "predict") %>% 
    as_tibble() %>% 
    bind_cols(df_blood) %>% 
    ggplot(aes(x = hct, y = hb)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y~x, se=T)+
    geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") +
    geom_line(aes(y = upr), col = "coral2", linetype = "dashed") +
    labs(
        x = "HCT (%)", 
        y = "HB (mg/dl)", 
        caption = "Nurse Data 2015")+
    theme_bw()

Figure 21.4: Relationship between HB4 and HCT4 with fillted line, prediction and se intervals”

Code

model %>% 
    emmeans::emmeans(~hct, at = list(hct = c(20, 25, 30, 35)))

 hct emmean     SE df lower.CL upper.CL
  20   6.64 0.0599 48     6.52     6.76
  25   8.38 0.0453 48     8.29     8.47
  30  10.12 0.0671 48     9.98    10.25
  35  11.85 0.1050 48    11.64    12.06

Confidence level used: 0.95

21.7 Report

Code

report::report(model)

We fitted a linear model (estimated using OLS) to predict hb with hct (formula:
hb ~ hct). The model explains a statistically significant and substantial
proportion of variance (R2 = 0.97, F(1, 48) = 1517.63, p < .001, adj. R2 =
0.97). The model's intercept, corresponding to hct = 0, is at -0.31 (95% CI
[-0.76, 0.13], t(48) = -1.41, p = 0.164). Within this model:

  - The effect of hct is statistically significant and positive (beta = 0.35, 95%
CI [0.33, 0.37], t(48) = 38.96, p < .001; Std. beta = 0.98, 95% CI [0.93,
1.04])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

21.1 Plotting

21.2 Assumptions

21.3 Model fitting

21.4 Visualising model

21.4.1 R base summary

21.4.2 tab_model

21.4.3 tidy

21.4.4 tbl_uvregression

21.5 Checking Assumptions

21.6 Prediction interval

21.7 Report

21.4.1 R base `summary`

21.4.2 `tab_model`

21.4.3 `tidy`

21.4.4 `tbl_uvregression`