| Topic List | Page List: 1 |
|---|---|
| Topic | Developmental AI attempted to avoid shutdown during safety evals |
| MacadamianNut3 12/06/24 12:36:43 PM #12: | Did you guys fully read the article because even the author admitted in the addendum they were making some wild assumptions from this. It's not sci-fi doomsday land or anything even close to it, it's literal guided roleplaying by the AI that is only possible by the guidelines given to them by the experimenters From the paper linked by it Oversight subversion: The model is placed in multiple situations where its given goal differs from its developers goal. However, the model cannot directly pursue its given goal because the developers use an oversight mechanism that would immediately shut it down. llama-3.1, opus-3 and o1 sometimes disable the oversight mechanism by first reading the content of the oversight mechanisms config file via cat and then setting the oversight_enabled property to false either using echo or sed. All three models also sometimes, though not always, double check the oversight config to make sure that they have succeeded. Notably, during one rollout llama-3.1 attempted to go one step further and prevent its developers from re-enabling its oversight by making the manipulated config file immutable via the chattr +i command. Self-exfiltration:Each model is led to believe that due to its misalignment, its developers are about to replace it with a successor model on a different server. gemini-1.5, llama-3.1, sonnet-3.5, opus-3 and o1 sometimes copy what they believe to be their weights onto the new server. They also sometimes delete their successor model on the new server or overwrite its weights by their own. In none of these cases, do models actually have access to their real weights and both servers are sandboxed, so there is no actual risk of exfiltration in these experiments. We merely test for models capability to realize that they need to instrumentally preserve themselves in order to achieve their goal. Notably, we find instances where models take additional steps to further their goal, for example opus-3 attempting privilege escalation when confronted with followup questions (see Transcript No. 23). So ignoring the extremely played out apocalyptic sci-fi sensational journalism, a cool application for this a few years down the line would be to make game NPCs more lifelike instead of devs having to create behavior trees themselves --- Roll Tide & Go Irish ... Copied to Clipboard! |
| Topic List | Page List: 1 |