# 🏎️ Writing an Efficient RL Environment When thousands of environments run in parallel on a single GPU, what matters most for throughput is not what the step does, but what it doesn't do. The patterns below keep `env.step()` fully GPU-sync-free: no Python-side `.item()` / `.nonzero()` per step, no implicit host-device transfers, no buffer re-allocation. ## Pre-allocate every buffer Allocate every tensor your `step` and `reset` will write to *once*, with the final shape and dtype. Use `torch.empty(...)` for buffers that will be overwritten each step (no reason to pay for zeroing) and `torch.zeros(...)` only when the initial value matters (e.g. accumulators). ```python # good - allocate once self.obs_buf = torch.empty((num_envs, num_obs), dtype=gs.tc_float, device=gs.device) self.reset_buf = torch.ones((num_envs,), dtype=gs.tc_bool, device=gs.device) self.episode_length_buf = torch.empty((num_envs,), dtype=gs.tc_int, device=gs.device) ``` Re-allocating inside `step` (`torch.zeros(...)` per step) drops throughput hard once env count goes up: every allocation hits the CUDA caching allocator and synchronizes against pending work. Likewise, write into existing storage rather than replacing it: ```python # bad - allocates a fresh tensor every step self.commands = torch.where(reset_mask[:, None], new_commands, self.commands) # good - writes into the existing buffer torch.where(reset_mask[:, None], new_commands, self.commands, out=self.commands) ``` The `out=` form keeps `self.commands` pointing at the same storage, which matters when something else holds a view of it (a recorder, a logger, an observation builder). Same story for `.copy_(...)` versus `=`, and for `.masked_fill_(...)` versus `torch.where(...)` without `out=`. ## Use boolean masks for `envs_idx` `(condition).nonzero()[:, 0]` forces a GPU sync - the host needs to know how many indices came out to materialize a 1-D tensor. **Keep `envs_idx` as a boolean mask** all the way through and feed that mask directly to Genesis APIs and to `torch.where` / `masked_fill_`. ```python # bad - GPU sync on .nonzero() reset_idx = self.reset_buf.nonzero()[:, 0] self.last_actions[reset_idx] = 0.0 # good - boolean mask, no sync self.last_actions.masked_fill_(self.reset_buf[:, None], 0.0) ``` Genesis solver and entity setters (`set_qpos`, `set_dofs_position`, `set_pos`, ...) accept a boolean mask for `envs_idx`. So does the unified `reset(envs_idx=mask)` entry point. ## Read state through zero-copy accessors Reading entity state in the hot path is fine - *if* the accessor returns a zero-copy view into Genesis's underlying storage. The reads that support zero-copy on rigid entities, at the time of writing, are: | Read | Returns | |---|---| | `entity.get_pos()` / `entity.get_quat()` | base-link world pose | | `entity.get_vel()` / `entity.get_ang()` | base-link linear / angular velocity | | `entity.get_dofs_position()` / `entity.get_dofs_velocity()` | per-DOF position / velocity | | `entity.get_links_pos()` / `entity.get_links_quat()` / `entity.get_links_vel()` | per-link world poses and velocities | | `entity.get_contacts()` | active contact set for this entity | Any other read on the hot path likely allocates a fresh tensor; either lift it out of `step()`, or open an issue if it should be zero-copy. For sensor outputs specifically, prefer the bulk `scene.read_sensors()` / `entity.read_sensors()` over per-sensor `sensor.read()` calls when you observe many sensors at once (one batched tensor per sensor class instead of N separate calls). It always allocates fresh storage, but the cost amortizes across every sensor of a class. See {doc}`Sensors <../../sensors/index>` for the bulk-read API. ## Reset robot state The combination that resets a batch of envs without any GPU sync uses (a) a boolean mask for `envs_idx`, (b) the zero-copy setters with explicit pre-allocated source tensors, and (c) `skip_forward=True` so forward kinematics is computed once on the next `scene.step()` instead of inside every setter call: ```python # `mask` is a (num_envs,) bool tensor; `init_qpos` is pre-allocated in __init__ self.robot.set_qpos(self.init_qpos, envs_idx=mask, zero_velocity=True, skip_forward=True) self.robot.set_dofs_velocity(self.init_dof_vel, envs_idx=mask, skip_forward=True) ``` When resetting *all* environments, pass `envs_idx=None` (or omit it) - the implementation hits a faster "full overwrite" path that skips the per-env mask machinery. The recommended pattern is a single `reset(envs_idx=None | bool_mask)` entry point that branches once: ```python def reset(self, envs_idx=None): self.robot.set_qpos(self.init_qpos, envs_idx=envs_idx, zero_velocity=True, skip_forward=True) if envs_idx is None: self.last_actions.zero_() self.episode_length_buf.zero_() self.reset_buf.fill_(True) else: self.last_actions.masked_fill_(envs_idx[:, None], 0.0) self.episode_length_buf.masked_fill_(envs_idx, 0) self.reset_buf.masked_fill_(envs_idx, True) ``` For coarse resets that touch every solver state at once, `scene.rigid_solver.set_state(state_idx, state, envs_idx=mask, partial=True)` is the bulk equivalent - `partial=True` is the fast path; `partial=False` resets the whole scene and is significantly slower because of the auxiliary state it has to rebuild. Numerical blow-up (NaN positions, exploding velocities, constraint solver failure) needs to terminate the episode for *that env only*, without crashing the whole batch. The rigid solver exposes a per-env errno mask that you fold into the regular termination condition, so divergent envs are reset on the next `reset(self.reset_buf)` call with the same machinery that handles a normal episode end: ```python self.reset_buf = self.episode_length_buf > self.max_episode_length self.reset_buf |= torch.abs(self.base_euler[:, 1]) > self.cfg["termination_if_pitch_greater_than"] self.reset_buf |= self.scene.rigid_solver.get_error_envs_mask() ``` ## Apply commands Command application is the other write-side hot path. The zero-copy command writers on a rigid entity are: | Write | Effect | |---|---| | `entity.control_dofs_position(targets)` | PD target positions for the selected DOFs | | `entity.control_dofs_velocity(targets)` | PD target velocities | | `entity.control_dofs_force(forces)` | direct generalized forces | | `entity.set_dofs_stiffness(...)` / `entity.set_dofs_damping(...)` | PD gains | | `entity.set_dofs_velocity(vel, envs_idx=mask, skip_forward=True)` | direct velocity write | | `entity.set_qpos(qpos, envs_idx=mask, zero_velocity=..., skip_forward=...)` | direct configuration write | A few patterns matter when calling them: - **Match the DOF ordering of your action vector to the entity's internal DOF order**, so you can pass a `slice(start, stop)` rather than an index tensor. Slices are free; index tensors force a gather. The Go2 example precomputes `actions_dof_idx = torch.argsort(self.motors_dof_idx)` so its policy outputs (arranged by joint name) can be permuted into slice-friendly order before the call. - **Reuse the same target buffer across steps.** Build `target_dof_pos` into a pre-allocated tensor (`torch.empty_like(self.actions)` once at init, `out=` writes thereafter) rather than letting `actions * scale + default` produce a fresh tensor each call. - **Don't slice the action tensor by index just to skip non-actuated DOFs.** Either include them in the policy output and write a slice, or pass a slice through `motors_dof_idx` as `slice(...)`.