alignment targets

July 11, 2023

*Note: Most quick takes on this website were copied over from somewhere else and adapted by Claude.

A quick take on alignment targets, adapted from lesswrong. At some point I should make this into a blog post with stronger specific takes.

tldr: I'm a little confused about what labs are aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.

I think we could be very close to AGI, and I think it's important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.

It seems that right now, most labs are targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it's easier to ensure properties like corrigibility or truthfulness.

One intuition for why this may be true: if you took OpenAI's weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like 'truthfulness' is more natural than reward modeling and can be elicited more easily..

There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).

As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.

Some people I've talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that's the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven't really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it.
My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective.

Mostly, I want to avoid a scenario where labs do the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.