Alignment and Large Language Models

The Problem of Agentic Motivation

Some people believe that agents based on large language models could possess motivations that could encourage them to go wrong.

The alignment problem is that we need AI systems that could benefit to us but we also need a AI system that could benefit to us and do not pose harm to humanity and society.

Okay, in Joe’s definition, Safety is about avoiding what he called a loss of a control scenario.

And the benefits is getting access to the main benefits of superintelligence AI system.

the major and core alignment problem is that when human wants benefit from AI it also can lead failure around safety.

And if we failed alignment problem means we failed on safety.

About if we are successfully solved a Safety’s problem, because we can both achieve safety and benefit.

It means that first model can achieve our query, our command without any manipulation and and second it also means that the function the method that we elicit the model to achieve super intelligence is s also safe. third the model can be a model can become a line in principle is a staple and consistent feature.

And it also does not mean that once we found the a alignment it will simply scale to all models. No matter the no matter the training method or any other factors. And second a alignment is potentially in danger by extremely Compicative. environment.

It also does not mean that the the motivation of alignment AI is perfect. In the motivation level it does it do not have any misalignment with the human it keep stable and consistent. motivation perfection is a extremely difficult statement to achieve.

because of the competitive pressure solving a alignment right now is impossible. We treat model capacity above a alignment the requirement.

we should split the two phase of alignment problem.The first phase is that we align model on behavior side we ch we we control model do not beyond the safety boundary we set for them.The second phase is that if we can safely control model in the behavior sense then we probably have to say whether we can achieve certain states of motivation perfection.

The definition of loss of control.

Okay. the exact loss of control means that a an AI system resisting coercion or shut down.Intentionally means representing safety relevant facts, motivation and capabilities.Manipulating training process.Ignoring human instructions.Intentionally.Trying to escape from an operating embellamentSeeking unauthorized resources and other forms of power.The rightly harm humans as means of gaining maintaining power.Manipulating users designers.

So the loss of control is a safety problem, which means that a model engaging in power seeking behavior like this and the user and the designer of the AI intend aid to do so.

In the motivation level we yard not able to achieve motivation perfection states.But nevertheless we can align model with being the behavior level.With our safety protocol.And because power is useful in so many goals, so we could imagine that an advanced agent would have incentives to behave in a power seeking way.

and if a artificial intelligence weighs great power but also with great intention and capacity is which this power are granted by human, then we don’t think a AI take over our failed alignment progress.

Okay the typical loss of control scenarios are flggrant. It involves that artificial intelligence system manipulate human choice in fly gruntly not okay ways which we can discover them in the plain sight.

And there are so many cases of partial losses of control. For example, where a human actor intentionally designed deploy an AI to seek power in bad ways. Second, where it’s conceptually unclear what a given human actor intended.

And it is clear that voluntary transfers of power control to AI is unto the circumstances or bosses of control.

And we have to admit if we voluntarily transfer our power to AI, would probably cause easily loss of control scenarios.

$what is access to the main benefits of superintelligence.

so the definition of super intelligence.And is that AI was vastly better than human cognitive capabilities. This is to say that AI is way better than any expert to human at basically any cognitive task which such superiority is possible.Once a artificial intelligence have the minimal capascities to count as vastly better than human, then we got super intelligence.this is a conception what they also call a minimal capable of superintelligence.