In this paper, we propose a second order optimization method to learn models
where both the dimensionality of the parameter space and the number of training
samples is high. In our method, we construct on each iteration a Krylov
subspace formed by the gradient and an approximation to the Hessian matrix, and
then use a subset of the training data samples to optimize over this subspace.
As with the Hessian Free (HF) method of [7], the Hessian matrix is never
explicitly constructed, and is computed using a subset of data.