We propose a runtime shielding framework that adapts online to hidden parameters while offering provable probabilistic safety. The core components are:
- Online Hidden-Parameter Adaptation: We leverage function encoders to efficiently infer hidden parameters from recent observations, enabling both the policy and the shield to adapt without retraining.
- Safety-Regularized RL Objective (SRO): A novel objective function that balances reward maximization with safety by integrating a cost-sensitive value estimate, encouraging low-violation behavior during training. Refer to Appendix G to see why this design choice is important.
where
- Adaptive Shield: A runtime shield that filters potentially unsafe actions proposed by the policy. It uses the inferred dynamics and conformal prediction to quantify uncertainty in future state forecasts, ensuring actions comply with safety margins.
This combination allows for robust safety and performance even when the robot's dynamics change unexpectedly.