reinforcement learning

Safe reinforcement learning
Offline reinforcement learning

Improved SPI

Improved bounds for safe policy improvement
- • Offline reinforcement learning
- • Sample efficiency
- • Reliability
- • Safety
The available data is often limited in offline reinforcement learning. Current methods reliably manage to compute new policies that outperform the data collection policy. However, they might require prohibitive amounts of data. As we cannot collect more data, it is primordial to make the most out of the data available. Therefore, we develop a transformation of the underlying MDP with smaller bounds on the minimum amount of data required for improvement (Wienhöft et al., 2023). This allows offline RL algorithms to return better policies without losing reliability.

References
1. Wienhöft, P., Suilen, M., Simão, T. D., Dubslaff, C., Baier, C., & Jansen, N. (2023). More for Less: Safe Policy Improvement with Stronger Performance Guarantees. IJCAI, 4406–4415.
Act-Then-Measure

Reinforcement learning for partially observable environments with active measuring
- • Learning for planning and scheduling
- • Partially observable and unobservable domains
- • Uncertainty and stochasticity in planning and scheduling
Ever wondered when you should inspect the engine of your car? Or how often an electricity provider should check their cables to minimize outages and maintenance costs? Or how often should a drone use its battery-draining GPS system to keep an accurate idea of its positions? What connects these problems is one core question: Is the extra information from a measurement worth its cost?

In our recent work, we try to solve such problems quickly by making a distinction between control actions (which affect the environment) and measuring actions (which give us information). For the first, we take into account uncertainty about the current situation but ignore it when predicting the future, which makes our method faster. For the second, we describe a novel method to determine when we can rely on our predictions, and when we should measure to eliminate uncertainty instead.

Interested in how it performs? Have a look at our ICAPS paper (Krale et al., 2023) to find out!

References
1. Krale, M., Simão, T. D., & Jansen, N. (2023). Act-Then-Measure: Reinforcement Learning for Partially Observable Environments with Active Measuring. ICAPS, 212–220.
SPI-POMDPs

Reliable offline reinforcement learning (RL) with partial observability
- • Offline Reinforcement Learning
- • Partial Observability
- • Reliability
- • Safety
Limited memory is sufficient for reliable offline reinforcement learning (RL) with partial observability.

Safe policy improvement (SPI) aims to reliably improve an agent’s performance in an environment where only historical data is available. Typically, SPI algorithms assume that historical data comes from a fully observable environment. In many real-world applications, however, the environment is only partially observable. Therefore, we investigate how to use SPI algorithms in those settings and show that when the agent has enough memory to infer the environment’s dynamics, it can significantly improve its performance (Simão et al., 2023).

References
1. Simão, T. D., Suilen, M., & Jansen, N. (2023). Safe Policy Improvement for POMDPs via Finite-State Controllers. AAAI, 15109–15117.

Improved SPI

References

Act-Then-Measure

References

SPI-POMDPs

References