InterPReT: Interactive Policy Restructuring and Training Enable Effective Imitation Learning from Laypersons

Gavin Zhu, Jean Oh, Reid Simmons — Carnegie Mellon University

Paper (arXiv) Code (coming soon) BibTeX

InterPReT lets non-technical users teach control policies through a multi-turn loop: provide instructions, provide demonstrations, inspect agent behavior, and refine the next round.

Abstract

Imitation learning has shown success in many tasks by learning from expert demonstrations. However, most existing work relies on large-scale demonstrations from technical professionals and close monitoring of the training process. These are challenging for a layperson when they want to teach the agent new skills. To lower the barrier of teaching AI agents, we propose Interactive Policy Restructuring and Training (InterPReT), which takes user instructions to continually update the policy structure and optimize its parameters to fit user demonstrations. This enables end-users to interactively give instructions and demonstrations, monitor the agent's performance, and review the agent's decision-making strategies. A user study (N=34) on teaching an AI agent to drive in a racing game confirms that our approach yields more robust policies without impairing system usability, compared to a generic imitation learning baseline, when a layperson is responsible for both giving demonstrations and determining when to stop. This shows that our method is more suitable for end-users without much technical background in machine learning to train a dependable policy.

Walkthrough: Multi-turn Teaching Session

Swipe left/right or use the buttons to move between turns.

Turn 1

Participant instruction changes (verbatim) added instruction: Keep the car straight on a straight road Turn 1 demonstration

Policy rollout Turn 1 rollout

Show model summary

The model keeps the car centered and aligned by producing a steering value from the closest lateral offset, the mean future road heading, and the car's heading. Acceleration is encouraged when there is no corner ahead and speed is low; braking is increased when a corner is ahead, when offtrack, or when speed is high. All three control outputs are linear combinations of the computed scalar features and are then clipped to their required ranges.

Show PyTorch model

import torch
from torch import nn

class RacecarPolicy(nn.Module):
    def __init__(self):
        super().__init__()
        # Steering weights: steer = w_x * closest_x + w_theta * future_theta_mean + w_heading * abs_heading
        self.w_x = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))       # positive: x>0 -> steer right
        self.w_theta = nn.Parameter(torch.tensor(0.30, dtype=torch.float32))   # positive: upcoming left turn -> steer right to stay straight
        self.w_heading = nn.Parameter(torch.tensor(-0.20, dtype=torch.float32))# negative: reduce steer if car already pointed that way

        # Acceleration params: accel = base_accel + w_speed*(1 - speed) + w_corner_accel*(1 - any_corner)
        self.base_accel = nn.Parameter(torch.tensor(0.20, dtype=torch.float32))
        self.w_speed = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))       # more accel when speed low
        self.w_corner_accel = nn.Parameter(torch.tensor(0.30, dtype=torch.float32))# more accel when no corner

        # Braking params: brake = base_brake + w_corner_brake*any_corner + w_offtrack*offtrack + w_speed_brake*clamp(speed-0.6,0)
        self.base_brake = nn.Parameter(torch.tensor(0.00, dtype=torch.float32))
        self.w_corner_brake = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))
        self.w_offtrack = nn.Parameter(torch.tensor(0.50, dtype=torch.float32))
        self.w_speed_brake = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))

    def forward(self, tiles, indicators):
        # tiles: (B, L=8, 7), indicators: (B, 7)
        closest_x = tiles[:, 0, 0]                             # (B,)
        future_theta_mean = torch.mean(tiles[:, :, 4], dim=1)  # (B,)
        any_corner = (tiles[:, :, 6] > 0.5).any(dim=1).float() # (B,)
        offtrack = torch.clamp(torch.abs(closest_x) - 0.2, min=0.0)  # (B,)

        speed = indicators[:, 0]       # normalized [0,1]
        abs_heading = indicators[:, 1] # normalized [-1,1]

        # Steering (no bias): keep centered and counter small heading errors
        steer_raw = self.w_x * closest_x + self.w_theta * future_theta_mean + self.w_heading * abs_heading
        steer = torch.clamp(steer_raw, -1.0, 1.0)

        # Acceleration: prefer accelerate when no corner and speed is low
        accel_raw = self.base_accel + self.w_speed * (1.0 - speed) + self.w_corner_accel * (1.0 - any_corner)
        accel = torch.clamp(accel_raw, 0.0, 1.0)

        # Braking: increase for corners, offtrack, or high speed
        speed_excess = torch.clamp(speed - 0.6, min=0.0)
        brake_raw = self.base_brake + self.w_corner_brake * any_corner + self.w_offtrack * offtrack + self.w_speed_brake * speed_excess
        brake = torch.clamp(brake_raw, 0.0, 1.0)

        output = torch.stack([steer, accel, brake], dim=1)  # (B,3)
        return output

Turn 2

Participant instruction changes (verbatim) added instruction: Slow down on curves added instruction: give gas on the staight roads added instruction: don't leave the road Turn 2 demonstration

Policy rollout Turn 2 rollout

Show model summary

The policy uses the closest tile lateral offset and the average upcoming relative heading (and its magnitude) to compute a steering correction that keeps the car centered and aligned with the road. Curvature (avg_abs_theta) and border flags reduce the target speed; when speed is below target we accelerate, when above we brake. If the car is near/over the track edge (|x_closest|>0.2) both stronger steering toward center and extra braking are applied. All intermediate features are differentiable linear combos with simple nonlinearities (abs, mean, clamp).

Show PyTorch model

import torch
from torch import nn

class RacecarPolicy(nn.Module):
    def __init__(self):
        super().__init__()
        # Steering weights
        # steer_unclipped = w_x * closest_x + w_tile_theta * closest_tile_theta + w_car_theta * car_theta + steer_bias
        self.w_x = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))           # positive correlation: x -> steer right
        self.w_tile_theta = nn.Parameter(torch.tensor(-0.40, dtype=torch.float32)) # negative: tile theta positive (road left) -> steer left
        self.w_car_theta = nn.Parameter(torch.tensor(-0.20, dtype=torch.float32))  # negative: car_heading positive -> reduce right steering
        self.steer_bias = nn.Parameter(torch.tensor(0.0, dtype=torch.float32))

        # Track half-width (use a parameter instead of magic literal)
        self.track_half_width = nn.Parameter(torch.tensor(0.20, dtype=torch.float32))

        # Target speed parameters
        self.base_speed = nn.Parameter(torch.tensor(0.60, dtype=torch.float32))    # cruising speed on straight
        self.w_border = nn.Parameter(torch.tensor(-0.40, dtype=torch.float32))     # reduce speed on sharp corners
        self.w_theta = nn.Parameter(torch.tensor(-0.30, dtype=torch.float32))      # reduce speed when upcoming heading magnitudes are large
        self.w_x_speed = nn.Parameter(torch.tensor(-0.30, dtype=torch.float32))   # reduce speed when far from center

        # Acceleration / braking gains
        self.accel_gain = nn.Parameter(torch.tensor(1.0, dtype=torch.float32))     # scale for accel when below target
        self.brake_gain = nn.Parameter(torch.tensor(0.8, dtype=torch.float32))     # scale for braking when above target
        self.off_track_brake = nn.Parameter(torch.tensor(0.8, dtype=torch.float32))# extra brake when off track
        self.corner_brake = nn.Parameter(torch.tensor(0.4, dtype=torch.float32))   # extra braking on sharp corner

    def forward(self, tiles, indicators):
        """
        tiles: (B, L, 7)
        indicators: (B, 7)
        returns: (B, 3) -> [steer (-1..1), accel (0..1), brake (0..1)]
        """
        # Extract useful inputs
        closest_x = tiles[:, 0, 0]                       # (B,)
        closest_y = tiles[:, 0, 1]
        closest_tile_theta = tiles[:, 0, 4]
        tiles_theta_abs_mean = torch.mean(torch.abs(tiles[:, :, 4]), dim=1)
        max_border = torch.clamp(torch.max(tiles[:, :, 6], dim=1).values, 0.0, 1.0)

        speed_v = indicators[:, 0]
        car_theta = indicators[:, 1]

        abs_x = torch.abs(closest_x)

        # off_track in [0,1]: 0 if within half-width, positive up to 1 when beyond
        denom = self.track_half_width + 1e-6
        off_track = torch.clamp((abs_x - self.track_half_width) / denom, 0.0, 1.0)

        # Steering linear combination
        steer_unclipped = (
            self.w_x * closest_x
            + self.w_tile_theta * closest_tile_theta
            + self.w_car_theta * car_theta
            + self.steer_bias
        )
        steer = torch.clamp(steer_unclipped, -1.0, 1.0)

        # Target speed decreased for sharp corners, large heading magnitudes, and lateral offset
        target_speed = (
            self.base_speed
            + self.w_border * max_border
            + self.w_theta * tiles_theta_abs_mean
            + self.w_x_speed * abs_x
        )
        # Ensure target_speed in [0,1]
        target_speed = torch.clamp(target_speed, 0.0, 1.0)

        # Raw accel / brake signals
        accel_raw = target_speed - speed_v
        brake_raw = (speed_v - target_speed) * self.brake_gain + off_track * self.off_track_brake + max_border * self.corner_brake

        # Convert to 0..1 actions
        accel = torch.where(accel_raw > 0.0, torch.clamp(accel_raw * self.accel_gain, 0.0, 1.0), torch.zeros_like(accel_raw))
        brake = torch.clamp(brake_raw, 0.0, 1.0)

        # If off-track, suppress acceleration proportionally
        accel = accel * (1.0 - off_track)

        # Final output [steer, accel, brake]
        output = torch.stack([steer, accel, brake], dim=1)
        return output

Turn 3

Participant instruction changes (verbatim) added instruction: brake on tight curves in the way that the car don't leave the road added instruction: never leave the road Turn 3 demonstration

Policy rollout Turn 3 rollout

Show model summary

The model determines steering from lateral offset, relative tile heading, and car heading to keep the car centered and aligned with the road. It computes a target speed that is reduced on sharp corners, for large heading curvature, and when the car is far from center. Acceleration is applied when current speed is below target; braking is applied when above target. If the car is off the road (abs_x > track_half_width) brakes are elevated and acceleration suppressed to avoid leaving the road. All new internal features are linear combinations of earlier features, with learnable weights and small biases.

Show PyTorch model

import torch
from torch import nn

class RacecarPolicy(nn.Module):
    def __init__(self):
        super().__init__()
        # Steering: steer = w_x * x_closest + w_theta * (-avg_theta), amplified on edges
        self.w_x = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))      # positive corr with x -> steer right if x>0
        self.w_theta = nn.Parameter(torch.tensor(-0.30, dtype=torch.float32)) # negative corr: avg_theta>0 (road left) -> steer left (-)
        self.k_edge_steer = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))  # amplify steer when near edge

        # Target speed: base_speed + w_curv*avg_abs_theta + w_border*border + w_edge_speed*edge_dist
        self.base_speed = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))   # cruising bias
        self.w_curv = nn.Parameter(torch.tensor(-0.40, dtype=torch.float32))      # curvature reduces speed
        self.w_border = nn.Parameter(torch.tensor(-0.40, dtype=torch.float32))    # border reduces speed
        self.w_edge_speed = nn.Parameter(torch.tensor(-0.40, dtype=torch.float32))# edge reduces speed

        # Accel / Brake scaling
        self.accel_gain = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))    # scale positive speed error
        self.brake_gain = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))    # scale negative speed error
        self.k_brake_curv = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))  # extra brake from curvature
        self.k_brake_border = nn.Parameter(torch.tensor(0.40, dtype=torch.float32))# extra brake from border tiles
        self.k_brake_edge = nn.Parameter(torch.tensor(0.45, dtype=torch.float32))  # extra brake when off-edge

    def forward(self, tiles, indicators):
        # tiles: (B, L=8, 7), indicators: (B, 7)
        x_closest = tiles[:, 0, 0]                         # (B,)
        theta_tiles = tiles[:, :, 4]                       # (B, L)
        border_tiles = tiles[:, :, 6]                      # (B, L)
        speed = indicators[:, 0]                           # (B,)

        avg_theta = torch.mean(theta_tiles, dim=1)                         # (B,)
        avg_abs_theta = torch.mean(torch.abs(theta_tiles), dim=1)          # (B,)
        border_present = border_tiles.max(dim=1).values                    # (B,) 0 or 1

        edge_distance = torch.clamp(torch.abs(x_closest) - 0.2, min=0.0)   # (B,)

        # Steering
        steer_base = self.w_x * x_closest + self.w_theta * (-avg_theta)    # no bias: zero inputs -> zero steer
        steer_raw = steer_base * (1.0 + self.k_edge_steer * edge_distance)
        steer = torch.clamp(steer_raw, min=-1.0, max=1.0)

        # Target speed, clamped to [0,1]
        target_speed_unclamped = (self.base_speed
                                  + self.w_curv * avg_abs_theta
                                  + self.w_border * border_present
                                  + self.w_edge_speed * edge_distance)
        target_speed = torch.clamp(target_speed_unclamped, min=0.0, max=1.0)

        # Speed error
        speed_error = target_speed - speed

        # Acceleration: positive error
        accel_raw = self.accel_gain * torch.clamp(speed_error, min=0.0)
        accel = torch.clamp(accel_raw, min=0.0, max=1.0)

        # Brake: negative error plus curvature/border/edge braking
        brake_from_error = self.brake_gain * torch.clamp(-speed_error, min=0.0)
        brake_extra = (self.k_brake_curv * avg_abs_theta
                       + self.k_brake_border * border_present
                       + self.k_brake_edge * edge_distance)
        brake_raw = brake_from_error + brake_extra
        brake = torch.clamp(brake_raw, min=0.0, max=1.0)

        output = torch.stack([steer, accel, brake], dim=1)  # (B,3)
        return output

Study Setup

We conducted an in-person between-subject user study (N=34) in a car-racing teaching task. Participants used a gamepad interface to provide demonstrations, and the experimental group also provided language instructions for restructuring.

Participants first practiced control, then repeatedly taught and evaluated policies. They could choose start conditions, provide multiple demonstrations, retrain, and test iteratively before submission.

Appendix interface screenshot from the paper source.

Findings (Hypotheses Matched with Figures)

Swipe left/right or use the buttons to browse findings.

H1: InterPReT needs fewer demonstrations for similar nominal performance Supported

The paper reports equivalent nominal performance while requiring fewer demonstration steps under InterPReT. (Figure 4 + Figure 5 from the original paper)

H2-H4: InterPReT is more robust on unseen tracks, edge-case starts, and noisy actions Supported

Across robustness conditions, InterPReT outperformed the baseline according to the mixed-effects analysis in the paper.

H5: Users perceive InterPReT policies as better Partially supported

Perceived gains were significant on the seen track but not significant for unseen-track expectations.

H6: Usability of InterPReT is not worse than baseline Supported

System usability scores were statistically non-inferior, indicating added instruction interaction did not degrade usability.

Additional Analysis

Swipe left/right or use the buttons to browse analysis results.

Improvement through interactions

The paper shows that policy performance improves over interaction rounds. This supports the multi-turn design: users can diagnose current behavior and provide targeted follow-up teaching.

Handling diverse instructions

Instruction lengths and styles varied substantially, yet generated models still captured consistent concepts (for example, staying on-road and slowing at turns). This indicates robustness to natural language variation.

Distribution of instruction token counts

Show sample instructions

Terse and Abstract: "Stay within the grey track"
Terse and Specific: "Desired speed is 70."
Verbose: "when turning prioritize the middle of the road rather than the inside of the bend. this will limit your chances of hitting grass. in general, try to stay in the middle of the road since you are surrounded by grass"
Second-person: "speed up to go as fast as you can"
Third-person: "Keep the car straight on a straight road"
Typo: "Turn a corner"

Additional ablation findings

The full InterPReT formulation outperformed alternatives in ablation experiments, suggesting that structure adaptation and language-guided modeling both matter in this learning setup.

Acknowledgements

We would like to thank Shridhula Srinivasan and Justin Ma for their help in prototyping the user interface, experimenting with prompt engineering, and running some pilot user studies. Additional thanks to Justin for running some final studies.

This research has been partially supported by Microsoft Corporation as part of the Keio CMU partnership. And Feiyu is supported by the Softbank Group - Arm PhD Fellowship.

This webpage is created by Copilot with GPT-5.3-Codex.

BibTeX

Show BibTeX citation

@inproceedings{10.1145/3757279.3785549,
author = {Zhu, Feiyu Gavin and Oh, Jean and Simmons, Reid},
title = {InterPReT: Interactive Policy Restructuring and Training Enable Effective Imitation Learning from Laypersons},
year = {2026},
isbn = {9798400721281},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3757279.3785549},
doi = {10.1145/3757279.3785549},
booktitle = {Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction},
pages = {864–873},
numpages = {10},
keywords = {Adaptable Policy Structure, Interactive Learning, Learning from Demonstrations},
location = {Edinburgh, Scotland, UK},
series = {HRI '26}
}

Abstract

Walkthrough: Multi-turn Teaching Session

Turn 1

Turn 2

Turn 3

Study Setup

Findings (Hypotheses Matched with Figures)

H1: InterPReT needs fewer demonstrations for similar nominal performance Supported

H2-H4: InterPReT is more robust on unseen tracks, edge-case starts, and noisy actions Supported

H5: Users perceive InterPReT policies as better Partially supported

H6: Usability of InterPReT is not worse than baseline Supported

Additional Analysis

Improvement through interactions

Handling diverse instructions

Additional ablation findings

Related Work

Acknowledgements

BibTeX