marlax.envs.GridWorld#
- class marlax.envs.GridWorld(grid, agents, target_rewards, together_reward, travel_reward, wrong_zone_penalty=-500, mismatch_penalty=-250)[source]#
Bases:
EnvironmentMulti-agent grid environment with dynamic reward activation and penalties.
- grid#
Dimensions (width, height) of the grid.
- Type:
tuple[int, int]
- agents#
Agent instances operating in the grid.
- Type:
list[Agent]
- target_rewards#
Reward for each agent upon correct target.
- Type:
list[float]
- together_reward#
Bonus for agents co-located.
- Type:
float
- travel_reward#
Penalty (cost) for each move.
- Type:
float
- wrong_zone_penalty#
Penalty for entering a wrong reward zone.
- Type:
float
- mismatch_penalty#
Penalty if agents split between correct zones.
- Type:
float
- possibilities#
Possible reward configurations.
- Type:
list[str]
- center_pos#
Coordinates of grid center.
- Type:
tuple[int,int]
- reward_place_to_coord#
Maps target IDs to coordinates.
- Type:
dict[str, tuple[tuple[int,int], …]]
- moves#
Maps action names to (dx,dy) offsets.
- Type:
dict[str, tuple[int,int]]
- poss_act_combinations#
All action combos per agent.
- Type:
list[tuple[str,…]]
- steps_without_reward#
Counter for steps without reward.
- Type:
int
- no_reward_threshold#
Steps limit before forced reset.
- Type:
int
- __init__(grid, agents, target_rewards, together_reward, travel_reward, wrong_zone_penalty=-500, mismatch_penalty=-250)[source]#
Initialize the grid world parameters and agents.
- Parameters:
grid (tuple[int,int]) – Grid dimensions as (width, height).
agents (list[Agent]) – Agent instances present in the environment.
target_rewards (list[float]) – Reward values per agent for correct target.
together_reward (float) – Bonus reward if all agents share a cell.
travel_reward (float) – Cost (negative reward) per move.
wrong_zone_penalty (float, optional) – Penalty for entering a wrong zone. Defaults to -500.
mismatch_penalty (float, optional) – Penalty if agents split between two target zones. Defaults to -250.
Methods
__init__(grid, agents, target_rewards, ...)Initialize the grid world parameters and agents.
Check if any agent is at the center and no reward target is active.
Detect if agents split between two correct target zones.
Check if any agent enters a non-target reward zone.
compute_rewards(rewards)Compute rewards based on agent positions and active reward target.
Compute all possible next global states from current positions.
Get the current global state representation.
move_agents(actions)Update agent positions based on provided actions.
reset()Randomly reposition agents and clear active rewards.
step(actions)Execute one time step in the environment:
- check_and_activate_rewards()[source]#
Check if any agent is at the center and no reward target is active. If so, activate the reward target. This method is meant to be overridden by regime-specific environments.
- Returns:
True if activation occurred.
- Return type:
bool
- check_mismatch()[source]#
Detect if agents split between two correct target zones.
- Returns:
True if agents occupy both target cells.
- Return type:
bool
- check_wrong_reward_zones()[source]#
Check if any agent enters a non-target reward zone.
- Returns:
True if a wrong-zone entry occurred.
- Return type:
bool
- compute_rewards(rewards)[source]#
Compute rewards based on agent positions and active reward target. Modify the rewards list in place.
- Parameters:
rewards (list[float]) – Current rewards (modified in place).
- Returns:
- (collected, rewards_list)
collected (bool): True if any reward was collected.
- Return type:
tuple
- get_possible_states()[source]#
Compute all possible next global states from current positions.
- Returns:
List of (positions, active_reward_target) for each action combo.
- Return type:
list[tuple]
- get_state()[source]#
Get the current global state representation.
- Returns:
((agent_positions), active_reward_target)
- Return type:
tuple
- move_agents(actions)[source]#
Update agent positions based on provided actions.
- Parameters:
actions (list[str]) – Actions for each agent.
- reset()[source]#
Randomly reposition agents and clear active rewards.
Sets each agent to a random cell and chooses a new true_reward_target.
- step(actions)[source]#
- Execute one time step in the environment:
Move agents according to their actions.
Check for reward activation (e.g., an agent reaching the center).
Check if agents are at the designated reward positions.
Apply travel penalty and together bonus.
Reset if no reward is collected for too long.
- Parameters:
actions (list[str]) – Action for each agent.
- Returns:
- (next_state, rewards, info)
next_state (tuple): New global state. rewards (list[float]): Reward per agent. info (dict): Diagnostics including activation and termination flags.
- Return type:
tuple