BitBank
Blog > Optimizing Training Loops for Market Forecasting Models

Optimizing Training Loops for Market Forecasting Models

Training a model on financial time series is not the same as training on ImageNet. The data is noisy, non-stationary, and the signal-to-noise ratio is brutally low. Standard deep learning recipes can silently produce models that memorize regimes instead of learning transferable dynamics.

Learning Rate Schedules That Respect Regime Boundaries

Cosine annealing with warm restarts is our default. The restart boundaries matter: we align them with known regime transitions in the training data (volatility regime changes, major structural breaks) rather than using fixed epoch counts. This prevents the optimizer from settling into a local minimum shaped by one regime.

We also use a lower peak learning rate than typical NLP fine-tuning. Financial time series have weaker gradients on average because the useful signal is small relative to noise. A learning rate that works for language model fine-tuning will often blow past the shallow loss landscape features that matter for market forecasting.

Gradient Accumulation Over Longer Windows

Market patterns often span hundreds or thousands of bars. A single batch of 32 sequences might not contain enough diversity to produce a meaningful gradient. We use gradient accumulation across 4-8 micro-batches before each optimizer step, effectively training with batch sizes of 128-256 sequences without the memory cost.

The key insight: accumulated gradients should span different market conditions within the same update step. We construct batches so that each accumulation window includes sequences from different volatility regimes and different assets. This produces gradient updates that generalize rather than overfit to the current regime.

Early Stopping on the Right Metric

Validation loss is not the right early stopping metric for trading models. A model can achieve low MAE by predicting "no change" with high confidence — which is statistically correct most of the time but useless for trading.

We stop training based on a composite metric:

  • Quantile calibration: are the 10th/50th/90th percentile forecasts actually calibrating correctly on held-out windows?
  • Directional accuracy on significant moves: when the model predicts a move above a threshold, how often is it right about direction?
  • Return-weighted MAE: errors on large-move bars are penalized more heavily because those are the bars that drive PnL.

This composite metric catches the failure mode where validation loss improves but forecast quality for trading degrades.

Mixed Precision With Caution

We train in bfloat16 mixed precision for speed, but we keep the loss computation and certain attention layers in full precision. Financial time series can have very small differences between values (e.g., consecutive close prices differ by 0.01%). Half-precision arithmetic can round these away entirely, destroying the signal the model needs to learn.

The practical rule: any layer that computes differences, ratios, or percentage changes between input values should stay in float32.

Checkpoint Selection Beyond Best Validation

We save checkpoints every N steps and evaluate each on a separate test window that the training loop never sees. The final deployed model is not necessarily the one with the lowest validation loss — it is the one with the best risk-adjusted simulated PnL on the test window.

We use a rolling window scheme: the test window advances forward in time as we retrain, so we are always evaluating on genuinely unseen data.

Data Ordering and Curriculum

We do not shuffle sequences randomly across the full training set. Instead we use a mild curriculum: the model sees recent data more frequently than old data, with exponential decay weighting. This biases the model toward current market dynamics without completely forgetting historical patterns.

The decay rate is a hyperparameter we tune per asset class. Crypto markets change faster than equity markets, so the decay is steeper for crypto training runs.

Summary

The training loop for market models needs to respect the unique properties of financial data: low signal-to-noise, regime non-stationarity, and the fact that the downstream objective (PnL) is only loosely correlated with standard regression losses. Every component — learning rate schedule, batch construction, early stopping, precision, checkpoint selection, and data ordering — needs to be tuned with these properties in mind.

Plug: BitBank applies these training techniques to produce live forecasts. Check the BitBank dashboard for current predictions and browse the tech blog for more implementation notes.