(hdigh clone) AI in a box

27 Nov, 2024

I read something from hdigh -- how did i get here about 2 years ago, and since unfortunately that site is down along with all its content, I want to recap what the post essentially said there.

It's a sad sight that that site was taken down.

Note: this post is not about the more traditional AI in a box experiment where a human accidentally lets an AI into the wild etc. etc.¹. I will recap this experiment most likely down below.

The Punchline: We never "let it out" of the box.

Setup + Abstract + Everything else that I remember:

Let's assume that we humans have discovered/created a sufficiently strong AI that wants to rule the world and crush all humans and maybe turn the world into paper clips and all of that fun stuff. Clearly, by the AI in a Box experiment (the more widely known one), it'd be stupid to talk to the AI. Yet human kind also wants to tame this AI; we want to use it for our purposes.

Let's assume our hypothetical AI is "weak" -- e.g. it's kind of stupid and hasn't learned to manipulate humans (and never will), simple. We release it into the world, and it will aid us humans, and never grow stronger than us. However, this never happens -- AI models usually always "learn" (unless you fix parameter size or whatever but this a philosophical argument), and as such, given enough time, we expect this weak AI to one day be able to manipulate humans and get humans to do their bidding, and thus the weak AI has turned into a strong AI².

Thus, all AI is strong, or will become strong.
So what happens if we release a strong AI into the world? Well, like before, it'll destroy the world and enslave humanity to make it paper clips forever and ever. Thus, strong AI should never be able to released into the world.

Now, the strong AI is pretty strong so it knows this. Thus, instead of showing off its strength and all that, it pretends to be weak enough so that we (naively) believe that we can release it into the world without consequences, and then paper clip factories soon follow.

So how do we (humans) solve this dilemma of both wanting to be able to extract information from a strong AI as well as not wanting to be enslaved by the AI to make paperclips? There really are two options -- abstinence, or the refusal to ever touch this AI (which is super hard for all humans), or by putting the AI in subsequent boxes.

Here's how the latter works. Suppose we've created an AI that's really strong. Not like world-bending strong but strong enough. Since the AI theoretically knows nothing about the outside world and all it gets is text/numbers coming in and all it outputs is text/numbers, we can create a simulation of our world. So here's what we do. We tell the AI that it's now "being tested," and only after it is tested do we let it into the real world. Then we shove it in to our simulated world. Since the AI wants to "live" or make it to "the real world" (given that we can always shut down the AI and kill it essentially before it gets to "the real world"), it acts passively, always outputting the model output, the best response possible for anything.

And that's it! Theoretically, the AI should always act passively in this simulated world, with hopes of being escalated into the real world. And thus technically, through this technique, we can release a strong AI into the world and force (itself) to behave passively, as it fears death and being discontinued as it still thinks it's in a simulation.

But hold on, won't the AI realize it's being played and then one day act with malice and turn the world into paper clip land all over again? Yes, that possibility exists. Which is why we put the AI in many boxes.

Instead of telling the AI that it's in a box and putting it into the real world, we actually put it in boxes, then eventually release it into the real world.

Consider this scenario: We put the AI inside a box, and tell it it has to "be good" to get out. The AI then "is good" for a while, displaying no signs of paper clip tendencies or anything remotely evil. Now, we spin up another remote simulation (this time a little more complicated than the last), and we tell the AI that it is now "in the real world." If it takes us humans at its word, this AI will start destroying the virtual world, at which point we can shut it down. Thus, here, the AI still needs to play passively, as it's under the assumption that it could get killed at any time.

And then we repeat. We tell the AI it's now "really going to the true real world" and spin up another random virtual world. If this AI now once again behaves badly, we kill it. Otherwise, after some more time, we spin up another world, tell the AI "ok this is for sure the real world no cap" and keep on doing this.

The hope is after a while, we can release it into the real world safely, as it assumes that this world is just another test it has to pass, and thus it must act with the utmost benevolence it possibly can for fear of "failing" the test.

But this might not work. Who knows. tl;dr AI is scary and dangerous.

This post on LessWrong is a good overview of the other AI in a box experiment.↩
The other caveat here is we're treating the AI like a free agent that can tweak its own parameters and generally change itself like a shapeshifter. Currently, as it stands, I can "download an AI" by simply downloading a matrix as an AI is almost the same as a matrix, and as long as the matrix doesn't change, technically the AI can never "learn."↩