emdp.examples package#
Submodules#
emdp.examples.action_gap module#
- emdp.examples.action_gap.build_cake_world_mdp(epsilon, discount, cake_reward=1.0)[source]#
Cake world MDP from Action Gap Paper (Fig 1 of Bellemare et al. 2016).
Increasing the Action Gap: New Operators for Reinforcement Learning. https://arxiv.org/pdf/1512.04860.pdf
The action gap is modulated by epsilon since the difference between Q values for each action is given by Q(x1, a2) - Q(x1, a2) = epsilon.
Args: :param epsilon: Float epsilon for the action gap. :param discount: Float discount factor. :param cake_reward: Float reward for eating cake. :returns: An emdp.common.MDP object.
emdp.examples.counter module#
emdp.examples.off_policy module#
- emdp.examples.off_policy.build_two_circle_MDP(discount=0.6, good_reward=10.0, distractor_reward=5.0)[source]#
MDP counter example given in Fig 1a of Zhang, et al.
See “Generalized Off-Policy Actor-Critic” https://arxiv.org/pdf/1903.11329.pdf
- Parameters
discount – The discount factor.
good_reward – The good reward that the agent must find.
distractor_reward – The disctraction reward.
- Returns
An emdp.common.MDP object.
emdp.examples.simple module#
- emdp.examples.simple.build_SB_example35()[source]#
Example 3.5 from (Sutton and Barto, 2018) pg 60 (March 2018 version). A rectangular Gridworld representation of size 5 x 5.
Quotation from book: At each state, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of −1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A’. From state B, all actions yield a reward of +5 and take the agent to B’
- emdp.examples.simple.build_SB_example41()[source]#
Example 4.1 from (Sutton and Barto, 2018) pg (Jan 2018 version).
- emdp.examples.simple.build_twostate_MDP()[source]#
MDP with transition probabilities P(s_0 | s_0, a_0) = 0.5 P(s_1 | s_0, a_0) = 0.5 P(s_0 | s_0, a_1) = 0 P(s_1 | s_0, a_1) = 1 P(s_1 | s_0, a_2) = 0 P(s_1 | s_1, a_2) = 1 Rewards: r(s_0, a_0) = 5, r(s_0, a_1) = 10, r(s_1, a_2) = -1 Discount factor : 0.95 :return:
emdp.examples.tricky_gridworlds module#
Environments with Tricky Rewards Two kinds of worlds are available:
- Symmetric Grid World
(0, size-1): +true_reward (size-1, 0): epsilon * true_reward
- Multi-minima Grid World
(2,0): best_reward (1,1): best_reward*2/3 (0,2): best_reward*1/3
- emdp.examples.tricky_gridworlds.make_four_minima_env(epsilon, best_reward=5, size=10, p_success=1, gamma=0.99, seed=2017)[source]#
- Makes a gridworld where there are four rewards:
- {(0, size/2): (2 - epsilon)*best_reward,
(size/2, 0): epsilon * best_reward, (size/2, size-1): epsilon * best_reward, (size-1, size/2): best_reward
}
and the agent starts in the middle of the grid at (size/2, size/2). Note that size must have an odd shape.
- Parameters
best_reward –
size –
p_success –
gamma –
seed –
- Returns
- emdp.examples.tricky_gridworlds.make_multi_minima_reward_env(best_reward=5, size=10, p_success=1, gamma=0.99, shuffle_rewards=False, seed=2017)[source]#
Multiple minima grid world where there are rewards on the diagonal with increasing value. For example, in a 3x3 grid world we have:
(2,0): best_reward (1,1): best_reward*2/3 (0,2): best_reward*1/3
if shuffle_rewards is True, we jumble the rewards along the diagonal.
- Parameters
epsilon – the proportion of the true reward to place
best_reward – the true reward
reward_spec – dict with the reward specification {(x,y):reward, …}
size – size of the grid world
p_success – the probability an action is successful
gamma – the discount factor for the MDP
shuffle_rewards – shuffle the rewards on the diagonal.
seed – the seed for the MDP
- emdp.examples.tricky_gridworlds.make_symmetric_epsilon_reward_env(epsilon, best_reward=5, size=10, p_success=1, gamma=0.99, seed=2017)[source]#
- Symmetric Grid World where the rewards are at [(x,y): reward]:
(0, size-1): +true_reward (size-1, 0): epsilon * true_reward
- Parameters
epsilon – the proportion of the true reward to place
best_reward – the true reward
reward_spec – dict with the reward specification {(x,y):reward, …}
size – size of the grid world
p_success – the probability an action is successful
gamma – the discount factor for the MDP
seed – the seed for the MDP
- Returns