The main goal of the package
bigstep
is to allow you to select a regression model using
the stepwise procedure when data is very big, potentially larger than
available RAM in your computer. What is more, the package gives you a
lot of control over how this procedure should look like. At this moment,
you can use one of these functions: stepwise()
,
forward()
, backward()
,
fast_forward()
, multi_backward()
and
combinations of them. They can be treated as blocks from which the whole
procedure of finding the best model is built.
# generate data
set.seed(1)
n <- 200
p <- 20
X <- matrix(rnorm(n * p), ncol = p)
colnames(X) <- paste0("X", 1:p)
y <- 1 + 0.4 * rowSums(X[, c(5, 10, 15, 20)]) + rnorm(n)
First, you have to convert your data to a proper format, an object of
class big
. It can be done using the function
prepare_data()
. In most cases you will only need to specify
a vector of responses, y
, and a matrix of predictors,
X
.
Then, you can use the stepwise procedure with, for example, Akaike
Information Criterion and summary()
to get more information
about the final model. The function get_model()
returns the
object of class lm
.
You can use only one forward step (for example if you want to choose the best predictor).
What is important, results are in the same format as input data
(class big
), so you can use forward()
again or
in combination with other functions (with different criteria if you
like). The pipe (%>%
) operator will be helpful. For
every step the actual number of variables in a model, mean squared error
(MSE) or accuracy (ACC; if you use the logistic
regression) and a value of the chosen criterion (crit) are
given.
It may seem unnecessary or even unjustified for small data, but can be useful if you have a lot of predictors (see the next paragraph).
# generate data
set.seed(1)
n <- 1e3
p <- 1e4
X <- matrix(rnorm(p * n), ncol = p)
colnames(X) <- paste0("X", 1:p)
Xadd <- matrix(rnorm(5 * n), n, 5) # additional variables
colnames(Xadd) <- paste0("Xadd", 1:5)
y <- 0.2 * rowSums(X[, 1000 * (1:10)]) + Xadd[, 1] - 0.1 * Xadd[, 3] + rnorm(n)
If you have a lot o predictors, it can be a good idea to remove those
that are not related with y
. You can do that using
reduce_matrix()
. This function calculates p-values for the
Pearson correlation test and every variable from X
(separately). Variables with p-values larger than minpv
will not be considered in the next steps (formally, they are removed
from candidates
, one of components of class
big
). Thanks to that, the whole stepwise procedure will be
much quicker. What is more, reduce_matrix()
changes the
order of predictors in such a way that at the beginning there will be
variables with the smallest p-values. It is important if you want to use
fast_forward()
.
Another problem is choosing an appropriate criterion to such data. Classical ones like AIC or BIC are bad choice because they will almost certainly select a model with too many variables1. You can use modifications of them like mBIC2, mBIC23, mAIC or mAIC2. In brief, these criteria have much heavier penalty for the number of parameters, so they prefer smaller models than their classic versions.
Additionally, in the example below we add variables from other matrix
to a model (Xadd
). It can be a good idea if there are
predictors which are important for us and want them to remain at every
stage of building the model.
data <- prepare_data(y, X, Xadd = Xadd)
data %>%
reduce_matrix(minpv = 0.15) %>%
stepwise(mbic) ->
results
summary(results)
data %>%
reduce_matrix(0.15) %>%
stepwise(bic) # bad idea...
Sometimes it will be reasonable to start with a model with some good
predictors and then use the stepwise procedure. It can be achieved if we
use fast_forward()
which adds every variable that reduces a
criterion (not necessarily the best one). It is important for that
function to search for variables in a reasonable order (first, the most
correlated with y
), so you should use
fast_forward()
after reduce_matrix()
(you can
set minpv = 1
if you do not want to remove any predictor,
just change the order). It is good idea to run
fast_forward()
with a criterion which does not have a heavy
penalty for the size of a model, so for example BIC is better than mBIC.
After adding a lot of variables, most of them will be useless, so it can
be a good idea to perform the backward elimination—as long as there are
variables reducing a criterion. Run multi_bacward()
to do
that.
If you are lucky, you do not have to run stepwise()
after fast_forward()
and multi_backward()
because you will already have the best model. It is important because
stepwise()
consists of potentially many backward and
forward steps and forward takes most time in whole procedure of building
a model (we have to check every predictor to find the best one). So the
fewer such steps, the faster you will get the model. It can be crucial
if you have big data.
Now, let consider data which is larger than RAM you have in your
computer. It is impossible to read it in a normal way, but in a process
of building regression model it is not necessary to have access to all
predictors at the same time. Instead, you can read only a part of the
matrix X
, check all variables from that part and then read
another one. To do that, you only need to read the matrix X
using read.big.matrix()
from bigmemory
package. The prepare_data()
has a parameter
maxp
which represents the maximum size (that is the number
of elements) of one part. If X
is bigger, it will be split.
It will be done even if your matrix is big but you have enough RAM to
read it in a normal way. It may seem unnecessary, but it is worth to do
because R is not very efficient in dealing with big matrices. Remember
that maxp
cannot be smaller than the number of observations
(rows in X
), by default it is 1e6.
In the code below we assume that you have a big matrix in a file
X.txt. Reading such matrix can be slow, but if you set
backingfile
and descriptorfile
you have to do
that once and next time you can use attach.big.matrix()
which is much faster.
Xbig <- read.big.matrix("X.txt", sep = " ", header = TRUE,
backingfile = "X.bin", descriptorfile = "X.desc")
# Xbig <- attach.big.matrix("X.desc") # much faster
y <- read.table("y.txt")
# data <- prepare_data(y, Xbig) # slow because of checking NA
data <- prepare_data(y, Xbig, na = FALSE) # set if you know that you do not have NA
data %>%
reduce_matrix(minpv = 0.001) %>%
fast_forward(crit = bic, maxf = 50) %>%
multi_backward(crit = mbic) %>%
stepwise(crit = mbic) -> m
summary(m)
We set minpv
to a low value to speed up the whole
procedure, but be careful because you can lose some important
predictors.
You can easily define your own criterion. It can depend on
log-likelihood (loglik
), the number of observations
(n
), the number of all variables (p
), the
number of variables currently in a model (k
), the matrix
with variables which are currently in a model (Xm
) and
constants. These parameters will be sent to your criterion, exactly with
such names (but you do not have to use all).
my_crit <- function(loglik, k, n, c1 = 0.5, c2 = 8) {
-c1*loglik + 10*sqrt(k*c2)
}
m <- reduce_matrix(data, minpv = 0.15) # data from the paragraph "Bigger data"
stepwise(m, crit = my_crit)
stepwise(m, crit = function(loglik, k, n) -0.4*loglik + 10*sqrt(k*8))
Because your criterion has access to a model matrix
(Xm
), you can use untypical ones which depend, for example,
on the average correlation between variables which are in a model.
The package allows you to fit logistic and Poisson models. All you
need to do is setting the parameter type
when you prepare
data. Take note that the reduce_matrix()
always calculates
the Pearson correlation, even if you use general linear models. What’s
more, it is a known fact that even if one is interested in a logistic
regression model, the stepwise regression can be performed based on
linear regression — very often the same variables will be selected (and
the whole procedure is faster).
# Poisson model
set.seed(1)
n <- 50
p <- 1000
X <- matrix(runif(n * p), ncol = p)
colnames(X) <- paste0("X", 1:p)
mu <- rowSums(X[, 100 * (1:5)])
y <- rpois(n, exp(mu))
data1 <- prepare_data(y, X, type = "linear")
data2 <- prepare_data(y, X, type = "poisson")
data1 %>%
reduce_matrix() %>%
stepwise() # did not see any variables
data2 %>%
reduce_matrix() %>%
stepwise()
# logistic model
set.seed(2)
n <- 100
X <- matrix(runif(n * p, -5, 5), ncol = p)
colnames(X) <- paste0("X", 1:p)
mu <- 0.8 * rowSums(X[, 100 * (1:5)])
prob <- 1 /( 1 + exp(-mu))
y <- rbinom(n, 1, prob)
data1 <- prepare_data(y, X, type = "linear")
data2 <- prepare_data(y, X, type = "logistic")
data1 %>%
reduce_matrix() %>%
stepwise()
data2 %>%
reduce_matrix() %>%
stepwise()
M. Bogdan, J.K. Ghosh, M. Zak-Szatkowska. Selecting explanatory variables with the modified version of Bayesian Information Criterion. Quality and Reliability Engineering International, 24:989–999, 2008.↩︎
M. Bogdan, J.K. Ghosh, R.W. Doerge. Modifying the Schwarz Bayesian Information Criterion to locate multiple interacting quantitative trait loci. Genetics, 167:989–999, 2004.↩︎
F. Frommlet, A. Chakrabarti, M. Murawska, M. Bogdan. Asymptotic Bayes optimality under sparsity for general distributions under the alternative, Technical report, arXiv:1005.4753v2, 2011.↩︎