marlax.agents.QAgent#

class marlax.agents.QAgent(init_position=None, actions=['stay', 'up', 'down', 'left', 'right'])[source]#

Bases: Agent

Q-learning agent that chooses actions based on a Q-table.

position#

Agent’s (x, y) position on the grid.

Type:

tuple

actions#

List of possible actions.

Type:

list

q_table#

Maps state_key to a dict of action->value.

Type:

dict

__init__(init_position=None, actions=['stay', 'up', 'down', 'left', 'right'])[source]#

Initialize an agent with a starting position and possible actions.

Parameters:
  • init_position (tuple, optional) – The (x, y) starting coordinates. Defaults to None.

  • actions (list of str, optional) – Available actions. Defaults to [‘stay’, ‘up’, ‘down’, ‘left’, ‘right’].

Methods

__init__([init_position, actions])

Initialize an agent with a starting position and possible actions.

choose(possible_states[, epsilon, agent_id])

Select an action using an epsilon-greedy policy.

get_max_state(possible_states)

Identify the state with the highest max Q-value.

update(state_key, action, reward, next_state_key)

Update the Q-table entry for a given state and action.

choose(possible_states, epsilon=0.1, agent_id=0)[source]#

Select an action using an epsilon-greedy policy.

Parameters:
  • possible_states (list) – List of global state keys to evaluate.

  • epsilon (float) – Exploration probability. Defaults to 0.1.

  • agent_id (int) – Identifier for this agent among multiple agents. Defaults to 0.

Returns:

The chosen action.

Return type:

str

get_max_state(possible_states)[source]#

Identify the state with the highest max Q-value.

This helper initializes missing table entries to zero.

Parameters:

possible_states (list) – List of global state keys.

Returns:

The state_key with the highest action value.

Return type:

any

update(state_key, action, reward, next_state_key, alpha=0.1, gamma=0.99)[source]#

Update the Q-table entry for a given state and action.

Applies the Q-learning update rule:

\[Q(s,a) = Q(s,a) + alpha * (r + gamma * max_a Q(s',a) - Q(s,a))\]

where: - \(s\) is the current state, - \(a\) is the action taken, - \(r\) is the received reward, - \(s'\) is the next state, - \(max_a Q(s',a)\) is the maximum Q-value for the next state.

Parameters:
  • state_key (any) – Current global state key.

  • action (str) – Action taken by the agent.

  • reward (float) – Reward received after action.

  • next_state_key (any) – Next global state key.

  • alpha (float) – Learning rate. Defaults to 0.1.

  • gamma (float) – Discount factor. Defaults to 0.99.