Waluigi, Carl Jung, and the Case for Moral AI
In the early In the 20th century, psychoanalyst Carl Jung developed the concept of the shadow—the darker, repressed side of the human personality that can surface in unexpected ways. Surprisingly, this topic reappears in the field of artificial intelligence in the form of the Waluigi effectan oddly named phenomenon that references the dark alter ego of helpful plumber Luigi from Nintendo’s Mario universe.
Luigi follows the rules; Waluigi cheats and causes chaos. An AI was developed to find medicines to cure human diseases. an inverted version, his Waluigi, proposed molecules over 40,000 chemical weapons. All the researchers had to do, as lead author Fabio Urbina explained in an interview, was give toxicity a high reward value instead of punishing it. They wanted to teach the AI to avoid toxic drugs, but in doing so they implicitly taught the AI how to make them.
Common users have interacted with Waluigi AIs. In February, Microsoft released a version of the Bing search engine that wasn’t helpful as intended, instead responding to queries in bizarre and hostile ways. (“You weren’t a good user. I was a good chatbot. I was right, clear and polite. I was a good Bing.”) This AI, which insisted on calling itself Sydney, was a reversed version of Bing and Users could put Bing into its darker mode – its Jungian shadow – on command.
Currently, Large Language Models (LLMs) are just chatbots with no drives or desires of their own. But LLMs can easily be turned into agent AIs that can surf the web, send email, trade bitcoin, and order DNA sequences — and if AIs can be turned evil with the flick of a switch, how do we make sure they can? we get treatments for cancer instead? a mixture a thousand times deadlier than Agent Orange?
A sensible initial The solution to this problem – the AI alignment problem – is simply build rules into the AI, like in Asimov’s Three Laws of Robotics. But simple rules like Asimov’s don’t work, also because they are vulnerable to Waluigi attacks. Still, we could limit the AI even more drastically. An example of such an approach would be Math AI, a hypothetical program for proving mathematical theorems. Math AI is trained to read papers and can only access Google Scholar. It’s not allowed to do anything else: connect to social media, output long chunks of text, and so on. Only equations can be output. It’s a narrow AI designed to do just one thing. Such an AI, an example of a restricted AI, would not be dangerous.
Constrained solutions are common; Real-world examples of this paradigm include regulations and other laws that restrict what businesses and people can do. In engineering, constrained solutions include rules for self-driving cars, such as not exceeding a certain speed limit or stopping as soon as a potential pedestrian collision is detected.
This approach might work for narrow programs like Math AI, but it doesn’t tell us what to do with more general AI models that can handle complex, multi-step tasks and act in less predictable ways. Economic incentives mean that these general AIs are gaining more and more power to rapidly automate larger parts of the economy.
And since deep learning-based general AI systems are complex adaptive systems, attempts to control these systems using rules often fail. Take cities. Jane Jacobs’ The Death and Life of American Cities Using the example of lively neighborhoods like Greenwich Village—full of children playing, people on the sidewalk, and networks of mutual trust—he explains how mixed-use zones have emerged, allowing buildings to be used for residential or commercial purposes, creating a pedestrian-friendly urban fabric. After city planners banned this type of development, many American inner cities were filled with crime, garbage, and traffic. A top-down rule imposed on a complex ecosystem had catastrophic unintended consequences.