Joe_Collman - LessWrong

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])

The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup - and here of course "least harmful" isn't a utopia, since it's a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?

I'm very pleased that people are thinking about this, but I fail to understand the optimism - hopefully I'm confused somewhere!
Is anyone working on toy examples as proof of concept?

I worry that there's so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I'd suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What's the basis to think we can find such a specification?

It seems to me that finding a fit-for-purpose safety/acceptability specification won't be significantly easier than finding a specification for ambitious value alignment.

Stephen Fowler's Shortform

Joe_Collman12h63

So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.

I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).

In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirable incentives (mostly).
Even here it's hard, since there'd always need to be a [gain more influence] mechanism to balance the possibility of losing your influence.

In practice, most of the implicit bets made through inaction go unnoticed - even where they're high-stakes (arguably especially when they're high-stakes: most counterfactual value lies in the actions that won't get done by someone else; you won't be punished for being late to the party when the party never happens).
That leaves the explicit bets. To look like a good decision-maker the incentive is then to make low-variance explicit positive EV bets, and rely on the fact that most of the high-variance, high-EV opportunities you're not taking will go unnoticed.

From my by-no-means-fully-informed perspective, the failure mode at OpenPhil in recent years seems not to be [too many explicit bets that don't turn out well], but rather [too many failures to make unclear bets, so that most EV is left on the table]. I don't see support for hits-based research. I don't see serious attempts to shape the incentive landscape to encourage sufficient exploration. It's not clear that things are structurally set up so anyone at OP has time to do such things well (my impression is that they don't have time, and that thinking about such things is no-one's job (?? am I wrong ??)).

It's not obvious to me whether the OpenAI grant was a bad idea ex-ante. (though probably not something I'd have done)

However, I think that another incentive towards middle-of-the-road, risk-averse grant-making is the last thing OP needs.

That said, I suppose much of the downside might be mitigated by making a distinction between [you wasted a lot of money in ways you can't legibly justify] and [you funded a process with (clear, ex-ante) high negative impact].
If anyone's proposing punishing the latter, I'd want it made very clear that this doesn't imply punishing the former. I expect that the best policies do involve wasting a bunch of money in ways that can't be legibly justified on the individual-funding-decision level.

Shane Legg's necessary properties for every AGI Safety plan

Joe_Collman16d31

Some thoughts:

Necessary conditions aren't sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
1. I don't see any reason to think that prompting would achieve this robustly.
2. As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
  1. I note that I don't expect trying it empirically to produce catastrophe in the immediate term (though I can't rule it out).
  2. I also don't expect it to produce useful understanding of what would give a robust generalization guarantee.
    1. With a lot of effort we might achieve [we no longer notice any problems]. This is not a generalization guarantee. It is an outcome I consider plausible after putting huge effort into eliminating all noticeable problems.
The "capabilities are very important [for safety]" point seems misleading:
1. Capabilities create the severe risks in the first place.
2. We can't create a safe AGI without advanced capabilities, but we may be able to understand how to make an AGI safe without advanced capabilities.
  1. There's no "...so it makes sense that we're working on capabilities" corollary here.
  2. The correct global action would be to try gaining theoretical understanding for a few decades before pushing the cutting edge on capabilities. (clearly this requires non-trivial coordination!)

Shane Legg's necessary properties for every AGI Safety plan

Joe_Collman16d40

I think it's important to distinguish between:

Has understood a load of work in the field.
Has understood all known fundamental difficulties.

It's entirely possible to achieve (1) without (2).
I'd be wary of assuming that any particular person has achieved (2) without good evidence.

What is the purpose and application of AI Debate?

Joe_Collman1mo20

Relevant here is Geoffrey Irving's AXRP podcast appearance. (if anyone already linked this, I missed it)

I think Daniel Filan does a good job there both in clarifying debate and in questioning its utility (or at least the role of debate-as-solution-to-fundamental-alignment-subproblems). I don't specifically remember satisfying answers to your (1)/(2)/(3), but figured it's worth pointing at regardless.

Counting arguments provide no evidence for AI doom

Joe_Collman3moΩ120

Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.

[emphasis mine]

This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".

A more reasonable-to-my-mind behavioral reductionist perspective might look like this.

Ruling out goal realism as a good way to think does not leave us with [the particular type of reductionist perspective you're highlighting].
In practice, I think the reductionist perspective you point at is:

Useful, insofar as it answers some significant questions.
Highly misleading if we ever forget that [this perspective doesn't show us that x is a problem] doesn't tell us [x is not a problem].

Critiques of the AI control agenda

Joe_Collman3moΩ120

Sure, understood.

However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.

I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me that this isn't sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)

Critiques of the AI control agenda

Joe_Collman3moΩ120

Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.

Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).

Models don't need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.

Overall, I'm with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.

It's unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.

If we don't understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don't necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ()] are not a subset of [inferences about author given input ( $x_{1}$ $\cup$ $x_{2}$ $\cup$ ... $\cup$ $x_{1000}$ )].

Debating with More Persuasive LLMs Leads to More Truthful Answers

Joe_Collman3moΩ340

Thanks for the thoughtful response.

A few thoughts:
If length is the issue, then replacing "leads" with "led" would reflect the reality.

I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".
Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.

I agree that this isn't a huge deal in general - however, I do think it's usually easy to fix: either a [name a process, not a result] or a [say what happened, not what you guess it implies] approach is pretty general.

Also agreed that improving summaries is more important. Quite hard to achieve given the selection effects: [x writes a summary on y] tends to select for [x is enthusiastic about y] and [x has time to write a summary]. [x is enthusiastic about y] in turn selects for [x misunderstands y to be more significant than it is].

Improving this situation deserves thought and effort, but seems hard. Great communication from the primary source is clearly a big plus (not without significant time cost, I'm sure). I think your/Buck's posts on the control stuff are commendably clear and thorough.

I expect the paper itself is useful (I've still not read it). In general I'd like the focus to be on understanding where/how/why debate fails - both in the near-term cases, and the more exotic cases (though I expect the latter not to look like debate-specific research). It's unsurprising that it'll work most of the time in some contexts. Completely fine for [show a setup that works] to be the first step, of course - it's just not the interesting bit.

Debating with More Persuasive LLMs Leads to More Truthful Answers

Joe_Collman3moΩ121

I'd be curious what the take is of someone who disagrees with my comment.
(I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)

I'm not clear whether the idea is that:

The title isn't an overstatement.
The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness)
The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention.
There are upsides to the current name, and it seems net positive. (e.g. if it'd get more attention, and [paper gets attention] is considered positive)
This is the usual standard, so [it's fine] or [it's silly to complain about] or ...?
Something else.

I'm not claiming that this is unusual, or a huge issue on its own.
I am claiming that the norms here seem systematically unhelpful.
I'm more interested in the general practice than this paper specifically (though I think it's negative here).

I'd be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it's an unhelpful equilibrium, but if we unilaterally depart from it it'll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)

LESSWRONG
LW

Posts

Wiki Contributions

Comments