GA-CCRi Analytical Development Services

Incanter and the GLM

I read somewhere that the Generalized Linear Model is the "workhorse of statistics" though I cannot seem to find the reference anymore.  The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions.  Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables.  This transformation is called the link function and is unique for each distribution in the exponential family.  Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation.  In particular, iteratively reweighted least squares of the likelihood function has been shown to converge on the MLE.

To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm.  If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:

(defn irls [y X B invlink dlink weight eps]
(let [
_irls (fn [Bnext]
(let [
eta (mmult X Bnext)
mu (invlink eta)
z (plus eta (mult (minus y mu) (dlink mu)))
W (diag (weight mu))]
(mmult
(solve (mmult (trans X) W X))
(trans X) W z)))
]
(last
(last
(take-while
(fn [x] (> (euclidean-distance
(first x)
(last x)) eps))
(partition 2 1 (iterate _irls B)))))))

In the above code, we define the update step as an internal function of the updated coefficients variable.  Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.

Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions.  I have shown Gaussian and Binomial distributions below:

(defstruct family :link :invlink :dlink :weight)
(def families
{
:gaussian (struct-map family
:link (fn [x] x)
:invlink (fn [x] x)
:dlink (fn [x] 1)
:weight (fn [mu] (repeat (length mu) 1)))
:binomial (struct-map family
:link (fn [x] (log (div x (minus 1 x))))
:invlink (fn [x] (div (exp x) (plus 1 (exp x))))
:dlink (fn [x] [x] (div 1 (mult x (minus 1 x))))
:weight (fn [mu] (to-vect (mult mu (minus 1 mu)))))
})

I have used the struct-map technique from Clojure which gives me a sort of family type.  Additional families would be specified here.  Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families.  The GLM function is shown:

(defn glm
([y X & opts]
(let [opts (when opts (apply assoc {} opts))
family (or (families (:family opts))
(:gaussian families))
intercept (or (:intercept opts) true)
eps (or (:eps opts) 0.01)
bstart (:bstart opts)]
(irls y
X
bstart (:invlink family)
(:dlink family)
(:weight family)
eps))))

The GLM function simply delegates to the IRLS function with the distribution specific link, inverse link, etc functions.

To test the GLM, I used the example from the Incanter linear-model documentation:

user=> (use '(incanter core stats datasets charts))
nil
user=> (def iris
(to-matrix (get-dataset :iris) :dummies true))
#'user/iris
user=> (def y (sel iris :cols 0))
#'user/y
user=> (def x (sel iris :cols (range 1 6)))
#'user/x
user=> (def iris-lm (linear-model y x))
#'user/iris-lm
user=> (:coefs iris-lm)
(2.171266292153149 0.4958889383890437 0.8292439122349187 -0.31515517332664444 -1.0234978144907245 -0.7235619577805039)

Now, does the GLM with the Gaussian family give the same coefficients?  First, we add an intercept column.

user=> (def x (bind-columns (repeat 150 1) x))
#'user/x
user=> (glm y x
:bstart (matrix [1 1 1 1 1 1])
:family :gaussian)
[ 2.1713
0.4959
0.8292
-0.3152
-1.0235
-0.7236]

Finally, to test the binomial family, I used the "infert" dataset from R:

user=> (def sp (matrix [2 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 2 1 1 2 2 2 2 0 1 0 0 2 0 2 1 2 0 1 2 0 0 1 0 0 2 0 0 2 2 2 1 1 2 2 0 2 1 2 2 1 1 2 0 1 1 2 2 0 0 1 1 2 2 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 2 0 1 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 0 0 1 1 0 0 0 2 0 2 0 1 0 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 1 0 1 1 1 0 0 1 1]))
#'user/sp
user=> (def in (matrix [1 1 2 2 1 2 0 0 0 0 1 2 1 2 1 2 2 0 2 0 0 2 0 0 1 0 0 0 1 2 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 2 2 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 2 0 0 2 0 2 0 2 1 0 2 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 1 0 0 0 0 0 2 1 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 2 0 0 0 0 0 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 1 2 1 1 2 2 2 0 1 0 2 1 0 1 1 1 0 1 0 1 0 2 0 1 0 1 0 0 1 1 0 0 0 0 2 0 0]))
#'user/in
user=> (def case (matrix [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]))
#'user/case
user=> (def X (bind-columns (repeat (length sp) 1) sp in))
#'user/X
user=> (glm case X
:bstart (matrix [0 1 1])
:family :binomial
:eps 0.001)
[-1.7078
1.1972
0.4182]

 

Go Back