|
My Project
|
Documentation: https://frabazz.github.io/lbfgs-FFNN/
This project implements advanced Quasi-Newton optimization methods, specifically L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) and its stochastic variant S-LBFGS, designed for large-scale minimization problems.
L-BFGS is a Quasi-Newton method that stores a limited history of $m$ updates to approximate the inverse Hessian.
A standard stochastic gradient descent implementation is provided as a baseline for comparison.
The S-LBFGS implementation follows the algorithm proposed by Moritz et al. (2016). It effectively integrates curvature information into stochastic optimization using a stable Hessian update and variance reduction.
$$ v_t = \nabla f_{S_t}(w_t) - \nabla f_{S_t}(\tilde{w}) + \mu $$
This $v_t$ is an unbiased estimator of $\nabla F(w_t)$ with reduced variance as $w_t \to \tilde{w}$.
$$ \nabla^2 F(w) \cdot s \approx \frac{\nabla F(w + \epsilon s) - \nabla F(w - \epsilon s)}{2\epsilon} $$
This avoids forming the Hessian matrix while capturing the curvature in the direction $s$.
The codebase is split into two concrete backends that share the same high-level flow (define a network, compute loss/gradients, and run an optimizer), but use different data structures and kernels:
src/minimizer/ hosts shared minimizer utilities and interfaces (full-batch and stochastic base classes, ring buffer).src/minimizer/lbfgs.hpp, src/minimizer/bfgs.hpp, src/minimizer/gd.hpp, src/minimizer/s_gd.hpp, src/minimizer/s_lbfgs.hpp, src/minimizer/newton.hpp.src/network.hpp and src/layer.hpp implement a dense MLP with flat parameter storage and Eigen-based forward/backward.src/unified_optimization.hpp, src/unified_launcher.hpp, and src/network_wrapper.hpp provide backend configuration, dataset setup, and optimizer strategies.src/cuda/device_buffer.cuh wraps device memory, src/cuda/cublas_handle.cuh manages cuBLAS, and src/cuda/kernels.cuh hosts custom kernels (activation, loss, etc.).src/cuda/network.cuh and src/cuda/layer.cuh implement a GPU MLP. Parameters and gradients live in contiguous device buffers; per-layer activations/deltas are allocated per batch; GEMMs use cuBLAS.src/cuda/minimizer_base.cuh defines a CUDA optimizer interface. src/cuda/gd.cuh, src/cuda/sgd.cuh, and src/cuda/lbfgs.cuh implement the training steps, with optional history tracking via src/iteration_recorder.hpp.src/unified_optimization.hpp, src/unified_launcher.hpp, and src/network_wrapper.hpp (CUDA specializations compiled under __CUDACC__).The directory structure is as follows:
We use CMake for build configuration.
If you prefer a docker setup, we provide a Docker image that installs all dependencies and lets you build/run inside the container. See enviroment/README.md.
Notes: If CUDA is enabled but no CUDA compiler is found, CUDA targets are skipped.
Standalone CMake is available under tests/burgers to keep the main project build unchanged. This target requires Clang and the Enzyme plugin.
Before building the Burgers tests, make sure Enzyme is compiled following the official guide https://enzyme.mit.edu/Installation/ or by building the copy under lib/Enzyme/enzyme:
We tested on clang++-19, but this should work with Clang >= 14.