← All Conjectures

Conjecture 6.6: Optimal Forgetting

VALIDATED A meta-gradient can learn the optimal forgetting rate, adapting it to the environment's stationarity.

Statement

There exists an optimal exponential forgetting rate \( \lambda^* \in (0, 1) \) that minimises prediction error, and this rate can be recovered by meta-gradient descent on the outer loss. The optimal rate depends on whether the environment is stationary or changing.

Status: Validated

Meta-gradient descent reliably recovers near-optimal \( \lambda^* \) values that match theoretical predictions:

  • Stationary environment: \( \lambda^* \approx 0.99 \) — remember almost everything, forget slowly
  • Changing environment: \( \lambda^* \approx 0.93 \) — forget faster to track changes

Evidence Summary

The experiment exp_memory_dynamics.sx initialises the forgetting rate at \( \lambda = 0.5 \) and uses meta-gradient descent to optimise it:

  • In stationary environments, the meta-gradient drives \( \lambda \) upward toward 0.99 within ~50 meta-steps
  • In environments with distribution shift every 100 steps, \( \lambda \) stabilises near 0.93
  • The recovered values closely match the analytically derived optima for exponential smoothing
  • The meta-gradient signal is clean and consistent: no local minima were observed

This demonstrates that the adaptation framework can learn its own hyperparameters — the system tunes its memory depth based on the environment it encounters.

Relevant Experiments

  • exp_memory_dynamics.sx — meta-gradient learning of forgetting rate
  • exp_sensitivity.sx — sensitivity of convergence to hyperparameter choices

What This Means

Optimal forgetting is a key capability for autonomous systems: the ability to automatically calibrate memory depth to the environment eliminates a critical hyperparameter. In stationary environments, the system learns to be an elephant (long memory); in volatile environments, it learns to be adaptive (short memory). This self-tuning property is a practical consequence of the theorem's contraction guarantees — the meta-gradient is well-behaved because the inner loop converges.