Conjecture 6.6: Optimal Forgetting
Statement
There exists an optimal exponential forgetting rate \( \lambda^* \in (0, 1) \) that minimises prediction error, and this rate can be recovered by meta-gradient descent on the outer loss. The optimal rate depends on whether the environment is stationary or changing.
Status: Validated
Meta-gradient descent reliably recovers near-optimal \( \lambda^* \) values that match theoretical predictions:
- Stationary environment: \( \lambda^* \approx 0.99 \) — remember almost everything, forget slowly
- Changing environment: \( \lambda^* \approx 0.93 \) — forget faster to track changes
Evidence Summary
The experiment exp_memory_dynamics.sx initialises the forgetting rate at \( \lambda = 0.5 \) and uses meta-gradient descent to optimise it:
- In stationary environments, the meta-gradient drives \( \lambda \) upward toward 0.99 within ~50 meta-steps
- In environments with distribution shift every 100 steps, \( \lambda \) stabilises near 0.93
- The recovered values closely match the analytically derived optima for exponential smoothing
- The meta-gradient signal is clean and consistent: no local minima were observed
This demonstrates that the adaptation framework can learn its own hyperparameters — the system tunes its memory depth based on the environment it encounters.
Relevant Experiments
exp_memory_dynamics.sx— meta-gradient learning of forgetting rateexp_sensitivity.sx— sensitivity of convergence to hyperparameter choices
What This Means
Optimal forgetting is a key capability for autonomous systems: the ability to automatically calibrate memory depth to the environment eliminates a critical hyperparameter. In stationary environments, the system learns to be an elephant (long memory); in volatile environments, it learns to be adaptive (short memory). This self-tuning property is a practical consequence of the theorem's contraction guarantees — the meta-gradient is well-behaved because the inner loop converges.