API#

Subpackages#

Submodules#

emdp.actions module#

emdp.analytic module#

Tools to get analytic solutions from MDPs.

we can compute \(v_\pi(s)\) recursively by solving the system of Bellman equations below [Bellman1957]:

\[\begin{split}\begin{align} v_\pi(s) &= \sum_{a} \left[ \pi(a|s) \left( r(s,a) + \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right) \right] \\ &=\sum_a \pi(a|s)r(s,a) + \gamma \sum_{s'} \left[ \left(\sum_a \pi(a|s)p(s'|s,a)\right) v_\pi(s') \right] \\ &=r_\pi(s) + \gamma \sum_{s'} p_\pi(s'|s) v_\pi(s') \end{align}\end{split}\]

These equations can also be written in matrix form with \(\mathbf{v}_\pi, \mathbf{r}_\pi \in \mathbb{R}^{|\mathcal{S}|}\) and \(\mathbf{p}_\pi \in \mathbb{R}^{|S|\times|S|}\):

\[\begin{split}\begin{align} \mathbf{v}_\pi &= \mathbf{r}_\pi + \gamma \mathbf{p}_\pi \mathbf{v}_\pi \\ &= (\mathbf{I} - \gamma \mathbf{p}_\pi)^{-1} \mathbf{r}_\pi \\ &= \Phi \mathbf{r}_\pi \end{align}\end{split}\]
Bellman1957(1,2)

Bellman, Richard. 1957. “A Markovian Decision Process.” Journal of mathematics and mechanics: 679–684.

emdp.analytic.calculate_P_pi(P, pi)[source]#

Calculates the transition matrix \(P\) under policy \(pi\). \(p_\pi:=Pr(s'|s,a\sim\pi))\), which is represented as a matrix of shape \(|\mathcal{S}|\times|\mathcal{S}|\).

\[p_\pi(s,s') = \sum_a \pi(a|s) p(s'|s, a)\]

where \(s\) and \(s'\) are the states before and after taking action \(a\).

Parameters
  • P (np.ndarray) – transition matrix of size \(|\mathcal{S}|\times|\mathcal{A}|\times|\mathcal{S}|\)

  • pi (np.ndarray) – matrix of size \(|\mathcal{S}|\times|\mathcal{A}|\) indicating the policy

Returns

a matrix of size \(|\mathcal{S}|\times|\mathcal{S}|\)

Return type

np.ndarray

emdp.analytic.calculate_R_pi(R, pi)[source]#

Calculates the expected reward \(r_\pi\) under policy \(\pi\), which is represented as a matrix of shape \(|\mathcal{S}|\).

\[r_\pi(s) = \sum_a \pi(a|s) r(s,a)\]
Parameters
  • R (np.ndarray) – reward matrix of size \(|\mathcal{S}|\times|\mathcal{A}|\)

  • pi (np.ndarray) – matrix of size \(|\mathcal{S}|\times|\mathcal{A}|\) indicating the policy

Returns

a matrix of size \(|\mathcal{S}|\)

Return type

np.ndarray

emdp.analytic.calculate_V_pi(P, R, pi, gamma)[source]#

Calculates the state-value \(v_\pi\) from the successor representation using the analytic form:

\[(\mathbf{I} - \gamma \mathbf{p}_\pi)^{-1} \mathbf{r}_\pi\]

where \(p_\pi(s,t) = \sum_a \pi(a|s) p(t|s, a)\) and \(r_\pi(s) = \sum_a \pi(a|s) r(s,a)\)

see also emdp.analytic.calculate_P_pi() and emdp.analytic.calculate_R_pi().

Note

we can compute \(v_\pi(s)\) recursively by solving the system of Bellman equations below [Bellman1957]:

\[\begin{split}\begin{align} v_\pi(s) &= \sum_{a} \left[ \pi(a|s) \left( r(s,a) + \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right) \right] \\ &=\sum_a \pi(a|s)r(s,a) + \gamma \sum_{s'} \left[ \left(\sum_a \pi(a|s)p(s'|s,a)\right) v_\pi(s') \right] \\ &=r_\pi(s) + \gamma \sum_{s'} p_\pi(s'|s) v_\pi(s') \end{align}\end{split}\]

These equations can also be written in matrix form with \(\mathbf{v}_\pi, \mathbf{r}_\pi \in \mathbb{R}^{|\mathcal{S}|}\) and \(\mathbf{p}_\pi \in \mathbb{R}^{|S|\times|S|}\):

\[\begin{split}\begin{align} \mathbf{v}_\pi &= \mathbf{r}_\pi + \gamma \mathbf{p}_\pi \mathbf{v}_\pi \\ &= (\mathbf{I} - \gamma \mathbf{p}_\pi)^{-1} \mathbf{r}_\pi \\ &= \Phi \mathbf{r}_\pi \end{align}\end{split}\]
Parameters
  • P (np.ndarray) – Transition matrix

  • R (np.ndarray) – Reward matrix

  • pi (np.ndarray) – policy matrix

  • gamma (float) – discount factor

Returns

state-value vector under policy \(\pi\).

Return type

np.ndarray

emdp.analytic.calculate_V_pi_from_successor_representation(Phi, R_pi)[source]#

Calculates the state-value vector \(\mathbf{v}_\pi\) from the successor representation \(\Phi\) and the expected reward \(\mathbf{r}_\pi\).

see also: emdp.analytic.calculate_V_pi()

Parameters
  • Phi (np.ndarray) – successor representation of size \(|\mathcal{S}|\times|\mathcal{S}|\)

  • R_pi (np.ndarray) – expected reward of size \(|\mathcal{S}|\)

Returns

value function of size \(|\mathcal{S}|\)

Return type

np.ndarray

emdp.analytic.calculate_successor_representation(P_pi, gamma)[source]#

Calculates the successor representation \(\Phi\)

\[\Phi := (\mathbf{I} - \gamma \mathbf{p}_\pi)^{-1}\]

see also: emdp.analytic.calculate_V_pi()

Parameters
  • P_pi

  • gamma

Returns

successor representation

Return type

np.ndarray

emdp.common module#

class emdp.common.Env(seed)[source]#

Bases: object

Abstract Environment wrapper.

Parameters

seed (int) – A seed for the random number generator.

set_seed(seed)[source]#
class emdp.common.MDP(P, R, gamma, p0, terminal_states, seed=1337, skip_check=False)[source]#

Bases: emdp.common.Env

reset()[source]#
set_current_state_to(state)[source]#
step(action)[source]#
Parameters

action – An integer representing the action taken.

Returns

emdp.exceptions module#

exception emdp.exceptions.EpisodeDoneError[source]#

Bases: TimeoutError

An error for when the episode is over.

exception emdp.exceptions.InvalidActionError[source]#

Bases: ValueError

An error for when an invalid action is taken

emdp.torch_analytic module#

Tools to get analytic solutions from MDPs.

These functions are differentiable as they are written in torch.

emdp.utils module#

emdp.utils.convert_int_rep_to_onehot(state, vector_size: int)[source]#

convert the int representation of a state (or states) to onehot representation.

Examples

>>> convert_int_rep_to_onehot(1,5)
array([0, 1, 0, 0, 0])
>>> convert_int_rep_to_onehot(np.array([1,2]),5)
array([[0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0]])
Parameters
  • state – int representation of state (or states).

  • vector_size (int) – size of onehot representation.

Returns

onehot representation of state (or states).

Return type

np.ndarray

emdp.utils.convert_onehot_to_int(state: numpy.ndarray)[source]#

convert the onehot representation of a state (or states) to index (or indices).

Examples

>>> convert_onehot_to_int(np.array([0,0,0,1,0]))
3
>>> convert_onehot_to_int(np.array([[0,0,0,1,0],[0,1,0,0,0]]))
array([3, 1])
Parameters

state (np.ndarray) – onehot representation of state (or states).

Returns

index (or indices).

Module contents#