Runtime Safety through Adaptive Shielding: From Hidden Parameter Inference to Provable Guarantees

¹The University of Virginia, ²The University of Texas at Austin

Abstract

Variations in hidden parameters, such as a robot's mass distribution or friction, pose safety risks during execution. We develop a runtime shielding mechanism for reinforcement learning, building on the formalism of constrained hidden-parameter Markov decision processes. Function encoders enable real-time inference of hidden parameters from observations, allowing the shield and the underlying policy to adapt online. The shield constrains the action space by forecasting future safety risks (such as obstacle proximity) and accounts for uncertainty via conformal prediction. We prove that the proposed mechanism satisfies probabilistic safety guarantees and yields optimal policies among the set of safety-compliant policies. Experiments across diverse environments with varying hidden parameters show that our method significantly reduces safety violations and achieves strong out-of-distribution generalization, while incurring minimal runtime overhead.

The Challenge: Safety with Unknown Dynamics

Robots and autonomous systems in open-world environments often encounter varying underlying dynamics due to unobserved hidden parameters (mass, friction, terrain). These variations pose significant safety risks and challenge the generalization of reinforcement learning (RL) systems. Current methods often trade off adaptability for safety, or vice-versa, especially when dynamics change online.

Our Approach: Adaptive Shielding with Safety Regularization

We propose a runtime shielding framework that adapts online to hidden parameters while offering provable probabilistic safety. The core components are:

Online Hidden-Parameter Adaptation: We leverage function encoders to efficiently infer hidden parameters from recent observations, enabling both the policy and the shield to adapt without retraining.
Safety-Regularized RL Objective (SRO): A novel objective function that balances reward maximization with safety by integrating a cost-sensitive value estimate, encouraging low-violation behavior during training. Refer to Appendix G to see why this design choice is important.

where
Adaptive Shield: A runtime shield that filters potentially unsafe actions proposed by the policy. It uses the inferred dynamics and conformal prediction to quantify uncertainty in future state forecasts, ensuring actions comply with safety margins.

This combination allows for robust safety and performance even when the robot's dynamics change unexpectedly.

Theoretical Results

Proposition: Optimality Preservation: SRO effectively guides policy learning toward safe behaviors while not unnecessarily degrading performance when the agent already behaves safely.

Theorem: Provable Safety and Performance: An optimal policy, when augmented with our adaptive shield, maximizes the expected cumulative discounted return while maintaining a tight bound on the average cost rate. This provides a formal guarantee that our framework not only achieves high performance in terms of task completion but also adheres to provable safety guarantees.

Empirical Results

Our method demonstrates significant improvements in safety and generalization compared to established baselines across various Safe-Gym benchmarks.

Training curves for safety and return - Car robot

RQ1: Balancing Safety and Task Performance (Car Robot)

Training curves for safety and return - Point robot

RQ1: Balancing Safety and Task Performance (Point Robot)

OOD Generalization: Return vs. Cost Rate

RQ2: Generalization to Out-of-Distribution Environments

RQ3: Execution-Time Efficiency

The runtime speed of our method compared to the baselines

Our approach introduces only modest runtime overhead compared to baselines, with the shield being triggered selectively. This makes it suitable for real-time deployment. (Refer to Table 1 in the paper for detailed metrics).

Further ablation studies and detailed results can be found in the main paper and appendix.