StandardizedPredictors
This package provides convenient and modular functionality for standardizing regression predictors. Standardizing predictors can increase numerical stability of some estimation procedures when the predictors are on very different scales or when they are non-orthogonal. It can also produce more interpretable regression models in the presence of interaction terms.
The examples below demonstrate the use of StandardizedPredictors.jl with GLM.jl, but they will work with any modeling package that is based on the StatsModels.jl formula.
Centering
Let's consider a (slightly) synthetic dataset of weights for adolescents of different ages, with predictors age
(continuous, from 13 to 20) and sex
, and weight
in pounds. The weights are based loosely on the medians from the CDC growth charts, which show that the median male and female both start off around 100 pounds at age 13, but by age 20 the median male weighs around 155 pounds while the median female weighs around 125 pounds.
julia> using StandardizedPredictors, DataFrames, StatsModels, GLM, StableRNGs
julia> rng = StableRNG(1);
julia> data = DataFrame(age=[13:20; 13:20],
sex=repeat(["male", "female"], inner=8),
weight=[range(100, 155; length=8); range(100, 125; length=8)] .+ randn(rng, 16))
16×3 DataFrame
Row │ age sex weight
│ Int64 String Float64
─────┼─────────────────────────
1 │ 13 male 99.4675
2 │ 14 male 107.956
3 │ 15 male 116.467
4 │ 16 male 122.728
5 │ 17 male 129.415
6 │ 18 male 139.016
7 │ 19 male 148.175
8 │ 20 male 155.676
9 │ 13 female 100.082
10 │ 14 female 103.818
11 │ 15 female 105.642
12 │ 16 female 111.043
13 │ 17 female 112.433
14 │ 18 female 117.52
15 │ 19 female 121.464
16 │ 20 female 125.232
In this dataset, there's obviously a main effect of sex: males are heavier than females for every age except 13 years. But if we run a basic linear regression, we see something rather different:
julia> lm(@formula(weight ~ 1 + sex * age), data)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
weight ~ 1 + sex + age + sex & age
Coefficients:
──────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept) 52.9701 2.5343 20.90 <1e-10 47.4483 58.4918
sex: male -56.9962 3.58404 -15.90 <1e-08 -64.8052 -49.1873
age 3.58693 0.152134 23.58 <1e-10 3.25545 3.9184
sex: male & age 4.37602 0.21515 20.34 <1e-09 3.90725 4.84479
──────────────────────────────────────────────────────────────────────────────
There is a main effect of sex but it goes in the exact opposite direction of what we know to be true, and says that males are 55 pounds lighter. The reason is that because there's an interaction between sex and age in this model, the main effect of sex the (extrapolated) difference in weight between sexes when age is 0.
That's a non-sensical value, since it's far outside of our range of ages. When we Center
age, we get something more meaningful:
julia> lm(@formula(weight ~ 1 + sex * age), data; contrasts=Dict(:age => Center()))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
weight ~ 1 + sex + age(centered: 16.5) + sex & age(centered: 16.5)
Coefficients:
──────────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 112.154 0.348583 321.74 <1e-24 111.395 112.914
sex: male 15.2081 0.492971 30.85 <1e-12 14.134 16.2822
age(centered: 16.5) 3.58693 0.152134 23.58 <1e-10 3.25545 3.9184
sex: male & age(centered: 16.5) 4.37602 0.21515 20.34 <1e-09 3.90725 4.84479
──────────────────────────────────────────────────────────────────────────────────────────────
We can also center age at a different value, like the start of our range where the difference is essentially zero:
julia> lm(@formula(weight ~ 1 + sex * age), data; contrasts=Dict(:age => Center(13)))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
weight ~ 1 + sex + age(centered: 13) + sex & age(centered: 13)
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 99.6001 0.636422 156.50 <1e-20 98.2134 100.987
sex: male -0.107954 0.900037 -0.12 0.9065 -2.06897 1.85306
age(centered: 13) 3.58693 0.152134 23.58 <1e-10 3.25545 3.9184
sex: male & age(centered: 13) 4.37602 0.21515 20.34 <1e-09 3.90725 4.84479
────────────────────────────────────────────────────────────────────────────────────────────