Offline RL Shielding

Safety in offline reinforcement learning via shields

Background: In the offline reinforcement learning (RL) problem, we aim to train/optimise an agent to perform a task from a fixed dataset of experiences. The problem is that no additional data can be collected, so the outcome of untested actions is unclear. Therefore, effectively evaluating the agent, so-called off-policy evaluation, remains an open problem. It is, therefore, hard to guarantee that the agent will perform well and, equally importantly, will only take safe actions. In previous work, the safety of RL agents has been improved through so-called shields that only allow the agent to take safe actions.

Project: You will investigate the use of shields in the offline RL problem. In particular, we are interested if optimising the agent under a shield can provide an alternative for off-policy evaluation, and whether we can improve the safe behaviour of the agents during learning from the dataset, such that we can deploy the agent any time during learning. Your tasks can be to analyse the theoretical use of shielding in offline RL and/or conduct experiments to study the comparison in performance in practice.

Affinity with or the motivation to learn about Markov decision processes and reinforcement learning is a big plus. Starting this project can easily lead you to a bachelor/master thesis and possibly to a scientific publication. Please feel free to reach out to maris.galesloot@ru.nl for any questions.

Supervisors: Maris Galesloot MSc, and Prof. Dr. Nils Jansen