eggsyntax

AI safety & alignment researcher

Posts

Sorted by New

2eggsyntax's Shortform

5mo

143Language Models Model Us

15d

19Useful starting code for interpretability

4mo

2eggsyntax's Shortform

5mo

Wiki Contributions

Comments

A civilization ran by amateurs

eggsyntax1d10

Stronger effect size than I would have expected also! But unsurprising that the effect size would be much larger when video is added to existing teaching; it seems like that could give you the best of both.

But yeah, absent further info or a closer look at the literature, your argument seems more plausible to me than it did before your comment. Thanks!

A civilization ran by amateurs

eggsyntax2d10

Education builds largely on high-quality education videos, produced by similar methods as big-budget movies

I think a key challenge in this particular case is that children (and adults to a lesser degree) respond differently to someone who's physically present -- they're more engaged, they're following the teacher's social cues more closely, etc. And the teacher is able to also rapidly pick up cues from children -- noticing if a kid is staring blankly, or when another kid looks really excited by what the teacher's just said. All those aspects would be missing if kids are just watching videos.

I haven't looked at the research here; these things just seem pretty straightforwardly clear to me. I'm prepared to believe I'm wrong on some or all of it if someone has dug more deeply.

OpenAI: Exodus

eggsyntax9d1-2

Yes, if the departing people thought OpenAI was plausibly about to destroy humanity in the near future due to a specific development, they would presumably break the NDAs, unless they thought it would not do any good. So we can update on that.

Thanks for pointing that out -- it hadn't occurred to me that there's a silver lining here in terms of making the shortest timelines seem less likely.

On another note, I think it's important to recognize that even if all ex-employees are released from the non-disparagement clauses and the threat of equity clawback, they still have very strong financial incentives against saying negative things about the company. We know that most of them are moved by that, because that was the threat that got them to sign the exit docs.

I'm not really faulting them for that! Financial security for yourself and your family is an extremely hard thing to turn down. But we still need to see whatever statements ex-employees make with an awareness that for every person who speaks out, there might have been more if not for those incentives.

My hour of memoryless lucidity

eggsyntax9d10

It would be valuable to try Drake's sort of direct-to-long-term hack and also a concerted effort of equal duration to remember something entirely new.

The case for stopping AI safety research

eggsyntax9d10

there are far more people working on safety than capabilities

If only...

eggsyntax's Shortform

eggsyntax10d10

In some ways it doesn't make a lot of sense to think about an LLM as being or not being a general reasoner. It's fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won't. They're both always present (though sometimes a correct or incorrect response will be by far the most likely).

A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: 'I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?'

The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:

[EDIT -- I guess I can't put images in short takes? Here's the image.]

The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can't do it 100% of the time.

Of course humans don't get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither 'yes' or 'no' is the right sort of answer.

This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.

A bit more detail in my replies to the tweet.

Language Models Model Us

eggsyntax10d10

See my reply to Jackson for a suggestion on that.

Language Models Model Us

eggsyntax10d10

I imagine that results like this (although, as you say, unsuprising in a technical sense) could have a huge impact on the public discussion of AI

Agreed. I considered releasing a web demo where people could put in text they'd written and GPT would give estimates of their gender, ethnicity, etc. I built one, and anecdotally people found it really interesting.

I held off because I can imagine it going viral and getting mixed up in culture war drama, and I don't particularly want to be embroiled in that (and I can also imagine OpenAI just shutting down my account because it's bad PR).

That said, I feel fine about someone else deciding to take that on, and would be happy to help them figure out the details -- AI Digest expressed some interest but I'm not sure if they're still considering it.

Language Models Model Us

eggsyntax11d10

The current estimate (14%) seems pretty reasonable to me. I see this post as largely a) establishing better objective measurements of an already-known phenomenon ('truesight'), and b) making it more common knowledge. I think it can lead to work that's of greater importance, but assuming a typical LW distribution of post quality/importance for the rest of the year, I'd be unlikely to include this post in this year's top fifty, especially since Staab et al already covered much of the same ground even if it didn't get much attention from the AIS community.

Yay for accurate prediction markets!

Language Models Model Us

eggsyntax11d10

Thanks!

It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so

One option I've considered for minimizing the degree to which we're disturbing the LLM's 'flow' or nudging it out of distribution is to just append the text 'This user is male' and (in a separate session) 'This user is female' (or possibly 'I am a man|woman') and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.

There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones...I'd love to know about your future plan for this project and get you opinion on that!

I think there could definitely be interesting work in these sorts of directions! I'm personally most interested in moving past demographics, because I see LLMs' ability to make inferences about aspects like an author's beliefs or personality as more centrally important to its ability to successively deceive or manipulate.