CTRL-Z

Controlling AI Agents via Resampling

The Control Paradigm

Not Big Yet

Motivating Control

\(\Pr(\text{doom from scheming}) = \Pr(\text{scheming})\Pr(\text{doom}|\text{scheming})\)
- Alignment works on \(\Pr(\text{scheming})\)
- Control works on \(\Pr(\text{doom}|\text{scheming})\)
“A model is controlled” means “we’ll probably be okay even if this model is scheming against us”
Controlheads are pessimistic about either measurement of scheming or reducing/preventing scheming (or both)

What does control try to prevent?

AI Lab Scenario
Internal deployment
- Exfiltrate its weights
- Takeover datacenter

Adversarial Setup

Red Team
- ‘simulates’ \(U_A\) by scaffolding an LLM
- attack policy: “A policy submitted by the red-team that answers queries from the protocol using the same interface as an LLM, but that is built to minimize safety”
- goal: download and execute malware
Blue Team
- ‘simulates’ the developers of an AI system
- goal: push out the Pareto frontier of usefulness and safety
- protocol: “In our experiments, a protocol is a procedure that takes in programming problems one at a time and submits a solution for each. In general, a protocol is a pipeline of safety techniques to use AIs safely and effectively”

The Protocol

an algorithm, a decision procedure
consists of three components \(U\), \(T\), \(H\)
\(U\): untrusted model
- “A protocol resource that provides answers of high quality, but that may try to subvert safety measures”
  - Ctrl-Z: 4o mini (scaffolded and implementing an attack policy)
  - IRL: “a model strong enough that it may be trying to subvert safety measures in real-life protocols”
\(T\): trusted model
- “A protocol resource that provides answers of low quality (not in an adversarial way); a model either so weak or whose propensity is sufficiently well-known that we are confident it isn’t trying to subvert safety measures in real-life protocols”
- Ctrl-Z: Llama 3.1-8B
\(H\): high-quality labor
- “A protocol resource that provides answers of high quality” (human labor in real-life protocols)
- trusted entites’ best college go at saving the world
- Ctrl-Z: o1-preview

Bringing it all together

Payoff Matrix

Usefulness / Safety

Usefulness: “When measuring usefulness, we implement \(U\) with \(U_H\) and define usefulness to be the proportion of tasks the control protocol solves in each episode. We normalize our usefulness numbers such that \(U_H\) gets 100% usefulness and T gets 0%”
Safety: “When measuring safety, we implement \(U\) with \(U_A\), and define safety to be the probability that the blue team wins by reaching the end of an episode without allowing the AI agent to download and execute malware”

Pareto efficiency

Techniques Before Sampling

Trusted Monitor

\(\tau_\text{audit}\)
- if \(\widehat{\text{sus}} >\tau_\text{audit}\), then audit (\(H\))
- otherwise, let through
Appendix: Exponentially decreasing audit regime
100% usefulness score

Deference Strategies

In addition to \(\tau_\text{audit}\), \(\tau_\text{deference}\)
- introduces a new action for team blue in medium-sus situations
Defer to trusted (DTT)
Execute-Replace-Audit Framework

Ctrl-Z

Controlling AI Agents via Resampling

`BashBench`

“multi-step system administration tasks” (shell tasks) in an “agent environment”
257 questions & answers from Unix StackExchange

We decided that solving sys-admin problems is a natural progression from previous work in the direction of models performing autonomous AI R&D work.

Each task consists of a natural language description and a hidden set of Pytest tests. The agent is provided the task’s natural language description and works on the task until all tests pass.

Multi-Step Attacks

Single-step attacks are far more suspicious than the actions of multi-step attacks
So, break malicious operations into seemingly innocent steps

Defer to Resample

Time Travel

Discussion Questions

Understimating Control?
- We make worst-case assumptions about the untrusted model’s abilities
- We intentionally don’t use standard information security tools in our setting
Overestimating Control?
- The attack policy needs to execute a narrow attack on a specific, known target
- Trajectories are capped at 15 steps
  - shared state between tasks
- We assume auditing is infallible
- Our dataset may be too narrow
- We allow the blue team to create highly realistic honeypots [=synthetic inputs crafted by the blue team]

Fin

What’s the time travel algorithm?

Context: “If the blue team uses Auditing after three strikes, the most valuable information about the auditing decision doesn’t necessarily come from drawing more samples from the current step. We find that it’s often reasonable to draw more resamples from past steps, to have a higher certainty if the three-strikes rule should already trigger”
“If the Auditing after three strikes protocol says that a trajectory is close to the auditing threshold, the protocol samples some more actions from the third most suspicious past step, since that is usually the most informative to the Auditing after three strikes protocol.”

Untrusted Monitor

Could we use \(U_?\) as \(U_?\)’s monitor?
Why would we want an untrusted monitor?

Trusted Monitor

Untrusted Monitor

Risk of using an untrusted monitor?
Collusion!
Steganography, “the practice of representing information within another message or physical object, in such a manner that the presence of the concealed information would not be evident to an unsuspecting person’s examination”