Bogdan Ionut Cirstea

Turns out, someone's already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7

Interpreting the Learning of Deceit

Bogdan Ionut Cirstea20h10

I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the 'search' to many different undesirable kinds of behaviors.

Deep learning models might be secretly (almost) linear

Bogdan Ionut Cirstea1d10

If this is true, then we should be able to achieve quite a high level of control and understanding of NNs solely by straightforward linear methods and interventions. This would mean that deep networks might end up being pretty understandable and controllable artefacts in the near future. Just at this moment, we just have not yet found the right levers yet (or rather lots of existing work does show this but hasn't really been normalized or applied at scale for alignment). Linear-ish network representations are a best case scenario for both interpretability and control.
For a mechanistic, circuits-level understanding, there is still the problem of superposition of the linear representations. However, if the representations are indeed mostly linear than once superposition is solved there seem to be little other obstacles in front of a complete mechanistic understanding of the network. Moreover, superposition is not even a problem for black-box linear methods for controlling and manipulating features where the optimiser handles the superposition for you.

Here's a potential operationalization / formalization of why assuming the linear representation hypothesis seems to imply that finding and using the directions might be easy-ish (and significantly easier than full reverse-engineering / enumerative interp). From Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models (with apologies for the poor formatting):

'We focus on the goal of learning identifiable human-interpretable concepts from complex high-dimensional data. Specifically, we build a theory of what concepts mean for complex high-dimensional data and then study under what conditions such concepts are identifiable, i.e., when can they be unambiguously recovered from data. To formally define concepts, we leverage extensive empirical evidence in the foundation model literature that surprisingly shows that, across multiple domains, human-interpretable concepts are often linearly encoded in the latent space of such models (see Section 2), e.g., the sentiment of a sentence is linearly represented in the activation space of large language models [96]. Motivated by this rich empirical literature, we formally define concepts as affine subspaces of some underlying representation space. Then we connect it to causal representation learning by proving strong identifiability theorems for only desired concepts rather than all possible concepts present in the true generative model. Therefore, in this work we tread the fine line between the rigorous principles of causal representation learning and the empirical capabilities of foundation models, effectively showing how causal representation learning ideas can be applied to foundation models.

Let us be more concrete. For observed data X that has an underlying representation Zu with X = fu(Zu) for an arbitrary distribution on Zu and a (potentially complicated) nonlinear underlying mixing map fu, we define concepts as affine subspaces AZu = b of the latent space of Zus, i.e., all observations falling under a concept satisfy an equation of this form. Since concepts are not precise and can be fuzzy or continuous, we will allow for some noise in this formulation by working with the notion of concept conditional distributions (Definition 3). Of course, in general, fu and Zu are very high-dimensional and complex, as they can be used to represent arbitrary concepts. Instead of ambitiously attempting to reconstruct fu and Zu as CRL [causal representation learning] would do, we go for a more relaxed notion where we attempt to learn a minimal representation that represents only the subset of concepts we care about; i.e., a simpler decoder f and representation Z—different from fu and Zu—such that Z linearly captures a subset of relevant concepts as well as a valid representation X = f(Z). With this novel formulation, we formally prove that concept learning is identifiable up to simple linear transformations (the linear transformation ambiguity is unavoidable and ubiquitous in CRL). This relaxes the goals of CRL to only learn relevant representations and not necessarily learn the full underlying model. It further suggests that foundation models do in essence learn such relaxed representations, partially explaining their superior performance for various downstream tasks.
Apart from the above conceptual contribution, we also show that to learn n (atomic) concepts, we only require n + 2 environments under mild assumptions. Contrast this with the adage in CRL [41, 11] where we require dim(Zu) environments for most identifiability guarantees, where as described above we typically have dim(Zu) ≫ n + 2.'

'The punchline is that when we have rich datasets, i.e., sufficiently rich concept conditional datasets, then we can recover the concepts. Importantly, we only require a number of datasets that depends only on the number of atoms n we wish to learn (in fact, O(n) datasets), and not on the underlying latent dimension dz of the true generative process. This is a significant departure from most works on causal representation learning, since the true underlying generative process could have dz = 1000, say, whereas we may be interested to learn only n = 5 concepts, say. In this case, causal representation learning necessitates at least ∼ 1000 datasets, whereas we show that ∼ n + 2 = 7 datasets are enough if we only want to learn the n atomic concepts.'

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea1d10

More reasons to think something like the above should work: High-resolution image reconstruction with latent diffusion models from human brain activity literally steers diffusion models using linearly-decoded fMRI signals (see fig. 2); and linear encoding (the inverse of decoding) from the text latents to fMRI also works well (see fig. 6; and similar results in Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, e.g. fig. 2). Furthermore, they use the same (Stable Diffusion with CLIP) model used in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which both provides theory and demo empirically activation engineering-style linear manipulations. All this suggests similar Concept Algebra for (Score-Based) Text-Controlled Generative Models - like manipulations would also work when applied directly to the fMRI representations used to decode the text latents c in High-resolution image reconstruction with latent diffusion models from human brain activity.

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea1d10

For the pretraining-finetuning paradigm, this link is now made much more explicitly in Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm; as well as linking to model ensembling through logit averaging.

Refusal in LLMs is mediated by a single direction

Bogdan Ionut Cirstea2d10

We can implement this as an inference-time intervention: every time a component (e.g. an attention head) writes its output $c_{out} \in R^{d_{model}}$ to the residual stream, we can erase its contribution to the "refusal direction" $^r$ . We can do this by computing the projection of $c_{out}$ onto $^r$ , and then subtracting this projection away:
$c_{out}^{'} \leftarrow c_{out} - (c_{out} \cdot^r)^r$
Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.

I'll note that to me this seems surprisingly spiritually similar to lines 7-8 from Algorithm 1 (at page 13) from Concept Algebra for (Score-Based) Text-Controlled Generative Models, where they 'project out' a direction corresponding to a semantic concept after each diffusion step (in a diffusion model).

This seems notable because the above paper proposes a theory for why linear representations might emerge in diffusion models and the authors seem interested in potentially connecting their findings to representations in transformers (especially in the residual stream). From a response to a review:

Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with $y$. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ''linear subspace hypothesis'' in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic's usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Bogdan Ionut Cirstea2dΩ110

Our overall best guess is that an important role of early MLPs is to act as a “multi-token embedding”, that selects^[1] the right unit of analysis from the most recent few tokens (e.g. a name) and converts this to a representation (i.e. some useful meaning encoded in an activation). We can recover different attributes of that unit (e.g. sport played) by taking linear projections, i.e. there are linear representations of attributes. Though we can’t rule it out, our guess is that there isn’t much more interpretable structure (e.g. sparsity or meaningful intermediate representations) to find in the internal mechanisms/parameters of these layers. For future mech interp work we think it likely suffices to focus on understanding how these attributes are represented in these multi-token embeddings (i.e. early-mid residual streams on a multi-token entity), using tools like probing and sparse autoencoders, and thinking of early MLPs similar to how we think of the token embeddings, where the embeddings produced may have structure (e.g. a “has space” or “positive sentiment” feature), but the internal mechanism is just a look-up table with no structure to interpret.

You may be interested in works like REMEDI and Identifying Linear Relational Concepts in Large Language Models.

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea3d20-2

Contra both the 'doomers' and the 'optimists' on (not) pausing. Rephrased: RSPs (done right) seem right.

Contra 'doomers'. Oversimplified, 'doomers' (e.g. PauseAI, FLI's letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in '[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' We clearly don't have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they're still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.

Contra 'optimists'. I'm more unsure here, but the vibe I'm getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like 'no pause ever'; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea3d10

Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have 'read the whole internet' and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it's hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea3d30

Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:

'Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.'

If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).

Some theoretical results might also be relevant here, e.g. Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks.

LESSWRONG
LW

Posts

Wiki Contributions

Comments