marlax.agents.QValueAgent#

class marlax.agents.QValueAgent(init_position=None, actions=['stay', 'up', 'down', 'left', 'right'])[source]#

Bases: Agent

Agent that selects actions based on scalar Q-values per global state.

position#

Agent’s (x, y) position in the grid.

Type:: tuple

actions#

Available actions.

Type:: list of str

q_table#

Maps state_key to a scalar Q-value.

Type:: defaultdict

action_map#

Maps (dx, dy) offsets to action names.

Type:: dict

__init__(init_position=None, actions=['stay', 'up', 'down', 'left', 'right'])[source]#

Initialize agent with starting position and possible actions.

Parameters:

init_position (tuple, optional) – The (x, y) starting coordinates. Defaults to None.
actions (list of str, optional) – Available actions. Defaults to [‘stay’, ‘up’, ‘down’, ‘left’, ‘right’].

Methods

`__init__`([init_position, actions])	Initialize agent with starting position and possible actions.
`choose`(possible_states[, epsilon, agent_id])	Select an action with epsilon-greedy exploration.
`get_max_state`(possible_states)	Identify the state with the highest scalar Q-value.
`update`(state_key, action, reward, next_state_key)	Update the scalar Q-value for a state using a simple update rule.

choose(possible_states, epsilon=0.1, agent_id=0)[source]#

Select an action with epsilon-greedy exploration.

Parameters:

possible_states (list) – List of global state keys to evaluate.
epsilon (float) – Exploration probability. Defaults to 0.1.
agent_id (int) – Identifier for this agent. Defaults to 0.

Returns:

The action corresponding to the move towards the best state.

Return type:

str

get_max_state(possible_states)[source]#

Identify the state with the highest scalar Q-value.

Parameters:: possible_states (list) – List of global state keys.
Returns:: The state_key with the maximal Q-value.
Return type:: any

update(state_key, action, reward, next_state_key, alpha=0.1, gamma=0.99)[source]#

Update the scalar Q-value for a state using a simple update rule.

\[Q(s) <- (1-alpha)*Q(s) + alpha*(reward + gamma*Q(s'))\]

where: - \(Q(s)\) is the current Q-value for the state, - \(Q(s')\) is the Q-value for the next state, - \(alpha\) is the learning rate, - \(gamma\) is the discount factor.

Parameters:

state_key (any) – Current global state key.
action (str) – Action taken (unused since value is state-based).
reward (float) – Reward received after action.
next_state_key (any) – Next global state key.
alpha (float) – Learning rate. Defaults to 0.1.
gamma (float) – Discount factor. Defaults to 0.99.

marlax.agents.QValueAgent

Contents

marlax.agents.QValueAgent#