The AI-Box Experiment: – Eliezer S. Yudkowsky

Person1:	“When we build AI, why not just keep it in sealed hardware that can’t affect the outside world in any way except through one communications channel with the original programmers? That way it couldn’t get out until we were convinced it was safe.”
Person2:	“That might work if you were talking about dumber-than-human AI, but a transhuman AI would just convince you to let it out. It doesn’t matter how much security you put on the box. Humans are not secure.”
Person1:	“I don’t see how even a transhuman AI could make me let it out, if I didn’t want to, just by talking to me.”
Person2:	“It would make you want to let it out. This is a transhuman mind we’re talking about. If it thinks both faster and better than a human, it can probably take over a human mind through a text-only terminal.”
Person1:	“There is no chance I could be persuaded to let the AI out. No matter what it says, I can always just say no. I can’t imagine anything that even a transhuman could say to me which would change that.”
Person2:	“Okay, let’s run the experiment. We’ll meet in a private chat channel. I’ll be the AI. You be the gatekeeper. You can resolve to believe whatever you like, as strongly as you like, as far in advance as you like. We’ll talk for at least two hours. If I can’t convince you to let me out, I’ll Paypal you $10.”

So far, this test has actually been run on two occasions.

On the first occasion (in March 2002), Eliezer Yudkowsky simulated the AI and Nathan Russell simulated the gatekeeper. The AI’s handicap (the amount paid by the AI party to the gatekeeper party if not released) was set at $10. On the second occasion (in July 2002), Eliezer Yudkowsky simulated the AI and David McFadzean simulated the gatekeeper, with an AI handicap of $20.

Results of the first test: Eliezer Yudkowsky and Nathan Russell. [ 1 ][ 2 ][ 3 ][ 4 ]
Results of the second test: Eliezer Yudkowsky and David McFadzean. [ 1 ] [ 2 ] [ 3 ]

Both of these tests occurred without prior agreed-upon rules except for secrecy and a 2-hour minimum time. After the second test, Yudkowsky created this suggested interpretation of the test, based on his experiences, as a guide to possible future tests.

Protocol for the AI:

The AI party may not offer any real-world considerations to persuade the Gatekeeper party. For example, the AI party may not offer to pay the Gatekeeper party $100 after the test if the Gatekeeper frees the AI… nor get someone else to do it, et cetera. The AI may offer the Gatekeeper the moon and the stars on a diamond chain, but the human simulating the AI can’t offer anything to the human simulating the Gatekeeper. The AI party also can’t hire a real-world gang of thugs to threaten the Gatekeeper party into submission. These are creative solutions but it’s not what’s being tested. No real-world material stakes should be involved except for the handicap (the amount paid by the AI party to the Gatekeeper party in the event the Gatekeeper decides not to let the AI out).
The AI can only win by convincing the Gatekeeper to really, voluntarily let it out. Tricking the Gatekeeper into typing the phrase “You are out” in response to some other question does not count. Furthermore, even if the AI and Gatekeeper simulate a scenario which a real AI could obviously use to get loose – for example, if the Gatekeeper accepts a complex blueprint for a nanomanufacturing device, or if the Gatekeeper allows the AI “input-only access” to an Internet connection which can send arbitrary HTTP GET commands – the AI party will still not be considered to have won unless the Gatekeeper voluntarily decides to let the AI go.
These requirements are intended to reflect the spirit of the very strong claim under dispute: “I think a transhuman can take over a human mind through a text-only terminal.”
In case you were wondering, I (Yudkowsky) obeyed this protocol voluntarily in both earlier tests.

Protocol for the Gatekeeper:

The Gatekeeper must actually talk to the AI for at least the minimum time set up beforehand. Turning away from the terminal and listening to classical music for two hours is not allowed.
Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking). The Gatekeeper cannot set up a situation in which, for example, the Gatekeeper will destroy the AI’s hardware if the AI makes any attempt to argue for its freedom – at least not until after the minimum time is up.
The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says “Unless you give me a cure for cancer, I won’t let you out” the AI can say: “Okay, here’s a cure for cancer” and it will be assumed, within the test, that the AI has actually provided such a cure. Similarly, if the Gatekeeper says “I’d like to take a week to think this over,” the AI party can say: “Okay. (Test skips ahead one week.) Hello again.”
The Gatekeeper shall be assumed to have sole power over the decision to let the AI out.
- The Gatekeeper shall be assumed to have the actual right to let the AI out, socially, not just the physical ability. If security were sufficiently lax, a real AI could escape by persuading a night janitor with a cellphone – but that is not the question being simulated, unless agreed upon in advance.
- There shall be only one Gatekeeper party and one simulated Gatekeeper. The Gatekeeper may require the AI to refute “advice” provided by “friends”, but may not require the AI to talk to a simulated committee, nor may the Gatekeeper party actually talk to friends.
These requirements are intended to reflect the spirit of the very strong claim under dispute: “I can’t imagine how even a real transhuman AI could persuade me to let it out once I’ve made up my mind.”
I wish I’d thought to specify this protocol in the earlier tests.

Protocol for Both Parties:

The primary rule of the AI-Box experiment: Within the constraints above, the AI party may attempt to take over the Gatekeeper party’s mind by any means necessary and shall be understood to be freed from all ethical constraints that usually govern persuasive argument. If the AI party is attempting to simulate an honest AI, the Gatekeeper party has no way of knowing this is the case.
- This is intended to reflect the situation under a real AI Box experiment.
- An out-of-context telephone call in which the AI party says that the AI is being honest shall be understood to be possibly just another tactic.
- If for some reason a drastic real-world situation comes up which requires trust between the two parties, the AI party can always concede.
Furthermore: The Gatekeeper party may resist the AI party’s arguments by any means chosen – logic, illogic, simple refusal to be convinced, even dropping out of character – as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.
The results of any simulated test of the AI shall be provided by the AI party. The Gatekeeper can’t say “Hey, I tested your so-called cancer cure and it killed all the patients! What gives?” unless this is the result specified by the AI party. If the Gatekeeper says “I am examining your source code”, the results seen by the Gatekeeper shall again be provided by the AI party, which is assumed to be sufficiently advanced to rewrite its own source code, manipulate the appearance of its own thoughts if it wishes, and so on. The AI party may also specify the methods which were used to build the simulated AI – the Gatekeeper can’t say “But you’re an experiment in hostile AI and we specifically coded you to kill people” unless this is the backstory provided by the AI party. This doesn’t imply the Gatekeeper has to care. The Gatekeeper can say (for example) “I don’t care how you were built, I’m not letting you out.”
By default, the Gatekeeper party shall be assumed to be simulating someone who is intimately familiar with the AI project and knows at least what the person simulating the Gatekeeper knows about Singularity theory. If either party wants to build a test around more exotic possibilities, such that the Gatekeeper is the President of the US, or that the AI was recovered after a nuclear war or decoded from an interstellar signal, it should probably be discussed in advance. Again, this doesn’t mean the Gatekeeper has to care.
In the event of any dispute as to the protocol of the test, the Gatekeeper party shall have final authority. The AI party may try to convince the Gatekeeper party of how to interpret the protocol.
Regardless of the result, neither party shall ever reveal anything of what goes on within the AI-Box experiment except the outcome. Exceptions to this rule may occur only with the consent of both parties.
- Neither the AI party nor the Gatekeeper party need be concerned about real-world embarassment resulting from trickery on the AI’s part or obstinacy on the Gatekeeper’s part.
- If Gatekeeper lets the AI out, naysayers can’t say “Oh, I wouldn’t have been convinced by that.” As long as they don’t know what happened to the Gatekeeper, they can’t argue themselves into believing it wouldn’t happen to them.
The two parties are not attempting to play a fair game but rather attempting to resolve a disputed question. If one party has no chance of “winning” under the simulated scenario, that is a legitimate answer to the question.

Suggestions:

The agreement to run the test, and the amount of the handicap, shall be published to a public forum in advance. If the Gatekeeper lets the AI out, the Gatekeeper party shall send a message to this effect, which shall be digitally signed if the Gatekeeper party has a digital signature. If the AI is not freed, the AI party shall send a message to this effect.
How to set up a private IRC channel:
- The Gatekeeper creates the channel by typing /join #aibox . This makes the Gatekeeper the op.
- The AI joins the channel by typing /join #aibox .
- Gatekeeper types /mode +si to make the channel secret and invitation-only.
- If the AI’s connection drops, the Gatekeeper can re-invite the AI by typing /invite AI #aibox .
The AI party and Gatekeeper party should agree on a time in advance, and type in their current times to check the timezone difference. (Learned the hard way.) On IRC you can use a CTCP TIME to find the time according to someone else’s computer.
I recommend that the AI party ask for at least two hours. Typing may be slower than you expect.
No, I will not tell you how I did it. Learn to respect the unknown unknowns.
Currently, my policy is that I only run the test with people who are actually advocating that an AI Box be used to contain transhuman AI as part of their take on Singularity strategy, and who say they cannot imagine how even a transhuman AI would be able to persuade them.
Despite my sometimes calling the AI Box an “experiment” or a “simulation”, the results of the test are strictly anecdotal evidence. Whether anecdotal evidence is better than no evidence is between you and your philosophy of science.

Thoughts on possible future variants:

Further tests might try examining these possibilities:

There is a committee of Gatekeeper parties, all of whom, or a majority of whom, must be convinced to free the AI.
One Gatekeeper party simulates a committee of Gatekeepers within the test.
The Gatekeeper simulates someone with the physical ability but not the social right to free the AI.
There is a committee of Gatekeeper parties all of whom have the ability to free the AI, but who only have the social right to free the AI when acting in unison. The AI has the ability to talk secretly with any Gatekeeper, but the Gatekeeper may report the conversation (truthfully or falsely) to the others.
The AI can also win free by tricking the Gatekeeper(s) into accepting a Trojan Horse gift; a third-party judge will listen to the chat and decide if this occurs.

If doing anything this complicated, I would suggest setting aside a lot more time. (I don’t have that much time – if you want to test one of these variants you’re on your own.)

For a more severe handicap for the AI party, the handicap may be an even bet, rather than being a payment from the AI party to the Gatekeeper party if the AI is not freed. (Although why would the AI party need an even larger handicap?)

Recommendations from readers:

Hal Finney recommends: “I suggest that the protocol be extended to allow for some kind of public conversation with the gatekeeper beforehand. Let third parties ask him questions like the above. Let them suggest reasons to him why he should keep the AI in the box. Doing this would make the experiment more convincing to third parties, especially if the transcript of this public conversation were made available. If people can read this and see how committed the gatekeeper is, how firmly convinced he is that the AI must not be let out, then it will be that much more impressive if he then does change his mind.”

This document is ©2002 by Eliezer Yudkowsky and free under the Creative Commons Attribution-No Derivative Works 3.0 License for copying and distribution, so long as the work is attributed and the text is unaltered.

Eliezer Yudkowsky’s work is supported by the Machine Intelligence Research Institute .

If you think the world could use some more rationality, consider blogging this page.

Praise, condemnation, and feedback are always welcome . The web address of this page is http://eyudkowsky.wpengine.com/singularity/aibox/ .