marlax.envs.GridWorld

marlax.envs.GridWorld#

class marlax.envs.GridWorld(grid, agents, target_rewards, together_reward, travel_reward, wrong_zone_penalty=-500, mismatch_penalty=-250)[source]#

Bases: Environment

Multi-agent grid environment with dynamic reward activation and penalties.

grid#

Dimensions (width, height) of the grid.

Type:: tuple[int, int]

agents#

Agent instances operating in the grid.

Type:: list[Agent]

target_rewards#

Reward for each agent upon correct target.

Type:: list[float]

together_reward#

Bonus for agents co-located.

Type:: float

travel_reward#

Penalty (cost) for each move.

Type:: float

wrong_zone_penalty#

Penalty for entering a wrong reward zone.

Type:: float

mismatch_penalty#

Penalty if agents split between correct zones.

Type:: float

possibilities#

Possible reward configurations.

Type:: list[str]

center_pos#

Coordinates of grid center.

Type:: tuple[int,int]

reward_place_to_coord#

Maps target IDs to coordinates.

Type:: dict[str, tuple[tuple[int,int], …]]

moves#

Maps action names to (dx,dy) offsets.

Type:: dict[str, tuple[int,int]]

poss_act_combinations#

All action combos per agent.

Type:: list[tuple[str,…]]

steps_without_reward#

Counter for steps without reward.

Type:: int

no_reward_threshold#

Steps limit before forced reset.

Type:: int

__init__(grid, agents, target_rewards, together_reward, travel_reward, wrong_zone_penalty=-500, mismatch_penalty=-250)[source]#

Initialize the grid world parameters and agents.

Parameters:

grid (tuple[int,int]) – Grid dimensions as (width, height).
agents (list[Agent]) – Agent instances present in the environment.
target_rewards (list[float]) – Reward values per agent for correct target.
together_reward (float) – Bonus reward if all agents share a cell.
travel_reward (float) – Cost (negative reward) per move.
wrong_zone_penalty (float, optional) – Penalty for entering a wrong zone. Defaults to -500.
mismatch_penalty (float, optional) – Penalty if agents split between two target zones. Defaults to -250.

Methods

`__init__`(grid, agents, target_rewards, ...)	Initialize the grid world parameters and agents.
`check_and_activate_rewards`()	Check if any agent is at the center and no reward target is active.
`check_mismatch`()	Detect if agents split between two correct target zones.
`check_wrong_reward_zones`()	Check if any agent enters a non-target reward zone.
`compute_rewards`(rewards)	Compute rewards based on agent positions and active reward target.
`get_possible_states`()	Compute all possible next global states from current positions.
`get_state`()	Get the current global state representation.
`move_agents`(actions)	Update agent positions based on provided actions.
`reset`()	Randomly reposition agents and clear active rewards.
`step`(actions)	Execute one time step in the environment:

check_and_activate_rewards()[source]#

Check if any agent is at the center and no reward target is active. If so, activate the reward target. This method is meant to be overridden by regime-specific environments.

Returns:: True if activation occurred.
Return type:: bool

check_mismatch()[source]#

Detect if agents split between two correct target zones.

Returns:: True if agents occupy both target cells.
Return type:: bool

check_wrong_reward_zones()[source]#

Check if any agent enters a non-target reward zone.

Returns:: True if a wrong-zone entry occurred.
Return type:: bool

compute_rewards(rewards)[source]#

Compute rewards based on agent positions and active reward target. Modify the rewards list in place.

Parameters:

rewards (list[float]) – Current rewards (modified in place).

Returns:

(collected, rewards_list): collected (bool): True if any reward was collected.

Return type:

tuple

get_possible_states()[source]#

Compute all possible next global states from current positions.

Returns:: List of (positions, active_reward_target) for each action combo.
Return type:: list[tuple]

get_state()[source]#

Get the current global state representation.

Returns:: ((agent_positions), active_reward_target)
Return type:: tuple

move_agents(actions)[source]#

Update agent positions based on provided actions.

Parameters:: actions (list[str]) – Actions for each agent.

reset()[source]#

Randomly reposition agents and clear active rewards.

Sets each agent to a random cell and chooses a new true_reward_target.

step(actions)[source]#

Execute one time step in the environment:

Move agents according to their actions.
Check for reward activation (e.g., an agent reaching the center).
Check if agents are at the designated reward positions.
Apply travel penalty and together bonus.
Reset if no reward is collected for too long.

Parameters:

actions (list[str]) – Action for each agent.

Returns:

(next_state, rewards, info): next_state (tuple): New global state. rewards (list[float]): Reward per agent. info (dict): Diagnostics including activation and termination flags.

Return type:

tuple

marlax.envs.GridWorld

Contents

marlax.envs.GridWorld#