RL vs HuggingFace Trainer vs Custom Qwen Fine-Tuning: What Actually Works for Trading
2026-03-19
We ran all three approaches through the same market simulator on the same crypto data (BTC, ETH, SOL). The results were not what most ML-for-trading blog posts would lead you to expect.
The Three Approaches
1. Reinforcement Learning (PPO via Stable-Baselines3 / GymRL) — A PPO agent learns portfolio allocations directly from a gymnasium environment. The reward function incorporates log returns, turnover penalties, drawdown penalties, and trading costs.
2. HuggingFace-style Transformer Trainer — A custom TransformerTradingModel trained with our HF-style trainer loop. The model takes 60 bars of OHLCV data, produces action logits (buy/hold/sell), price predictions, and allocation signals.
3. Custom Qwen Fine-Tuning (LoRA) — Fine-tunes Qwen3.5-0.8B with LoRA adapters on chat-format trading plan data. The model receives market context and generates structured trading plans with action, confidence, entry, stop, and target.
The Results
| Approach | Return | Sortino | Max DD | Goodness | Train Time |
|---|---|---|---|---|---|
| RL (PPO/GymRL) | +35.8% | 2.11 | 27.6% | 32.5 | 122s |
| HF Transformer | 0.0% | 0.00 | 0.0% | -0.4 | 105s |
| Qwen Fine-Tune (LoRA) | 0.0% | 0.00 | 0.0% | -0.4 | 998s |
RL wins by a wide margin. The other two approaches produced zero trading activity — the models defaulted to "hold" on every bar.
Why RL Dominates
The PPO agent trained in about 2 minutes and generated a 35.8% return with a 2.11 Sortino ratio. The critical structural advantage: the agent directly optimizes the objective we care about.
Every gradient update pushes the policy toward higher risk-adjusted returns. There is no translation layer between what the model learns and what we measure. The gym environment encapsulates the market simulator, so the agent literally practices trading during training.
Key design choices:
- Continuous allocation weights let the agent learn nuanced position sizing
- Multi-signal feature cube with returns, volatility, and rolling stats per asset
- Reward shaping with turnover and drawdown penalties to prevent churning
- Cash position so the agent can sit out when uncertain
Why the Transformer Held Everything
The HF transformer trained successfully (loss: 2.3 → 1.3) but every inference produced "hold." This is the classic class imbalance problem: in most bars, the optimal label is "hold" because price moves are small. The model learned the base rate perfectly and ignored the rare buy/sell signals.
Fixable approaches: class-weighted loss, using the continuous allocation head instead of discrete actions, longer training with differentiable profit loss, or threshold-based signal extraction from price predictions.
Why Qwen Generated Only "Hold"
The Qwen model trained on 500 examples (loss: 2.5 → 1.2), learning the output format. But inference produced mostly "hold" due to training data imbalance, and each prediction took ~5 seconds through a 0.8B parameter model — impractical for hourly trading.
The Qwen approach may work better for daily trading with higher-conviction signals, where decisions are fewer and larger.
The Winning Combination
- Use RL (PPO) for the allocation layer — let the agent learn position sizing and timing from the reward signal
- Use transformer forecasts as features for the RL agent — Chronos2/HF models produce quantile forecasts that enrich the feature cube
- Use Qwen/LLM for regime analysis and filtering — LLMs process qualitative context (news, macro) as a filter on top of RL allocations
This layered approach captures the strengths of each method while avoiding the weakness of relying on any single one.
Code
The comparison experiment is fully reproducible:
source .venv313/bin/activate
python experiments/compare_rl_hf_qwen.py --symbols BTCUSD,ETHUSD,SOLUSD --quick
Full runs use 500k PPO timesteps, 10k HF steps, and 3 Qwen epochs. See the stock-prediction repo for the complete code.
Plug: BitBank uses this layered RL + forecaster approach in production. Check the dashboard for live predictions and the tech blog for more details.