ICA vs SAEs ⭐

Thinking about how Sparse Auto-encoders (SAEs) aim to learn a sparse over-complete basis (where you are trying to triangulate a larger number of sources than you have signals; e.g. you only have 8 ...

May 12, 2025 statistics

Thoughts on Hidden Structure in MLP Space

After deep-diving into why SAEs succeed at retrieving superposed features, what their limitations are, and closely inspecting the hidden technical implementations of the sae_lens library, I just wa...

Apr 27, 2025 mechanistic interpretability

Feature Splitting & Feature Absorption

My previous blogpost gives a very clear visualization of how the latents of simple ReLU networks look like, how to interpret them, and a good description of optimization pressures that force them i...

Mar 14, 2025 mechanistic interpretability

Optimization Failure ⭐

In my previous post, “Superposition - An Actual Image of Latent Spaces”, I illustrate how the parameters of a toy ReLU auto-encoder ($W$, $b$, and $\text{ReLU}$), work together to allow models repr...

Nov 3, 2024 mechanistic interpretability

Superposition - An Actual View of Latent Spaces ⭐ ⭐

This post is the prequel of the next post, “Optimization Failure”, where I investigate how, even in cases where perfectly symmetric, ideal weight configurations exist, ReLU toy models (following An...

Nov 3, 2024 mechanistic interpretability

High Dimension Computing

Introduction I recently came across an old lecture on High-Dimensional (HD) computing, in the forms of: This Quanta Magazine article This Stanford CS lecture by Pentti Kanerva And thought i...

Jul 4, 2024 statistics, hd-computing

Transformers are LSTMs v2

So I recently got into a bit of an argument with 2 friends while driving in the car to get dinner. It went a little bit like this: Me: I think LSTMs are definitely the precursor to transformers, m...

Dec 11, 2023 research, ML, time-series

Automatic Reinforcement Unlearning

Introduction This past year has really got me deep-diving into mechanistic interpretability research. I think it makes so much sense as a computational science, is very fundamental and generalizab...

Dec 5, 2023 research, statistics, interpretability

Common Spatial Pattern: Discriminator based on PCA

Principle Component Analysis (PCA) is a fundamental dimension-reduction technique that we all know to identify the top $k$ components of a set of $d-$ dimensional data points, where $k < d$. In ...

May 14, 2023 statistics

Human Writing is as Uniform as Machine Writing

Can we build a zero-shot Large Language Model (LLM) generated text detector without knowing which LLM potentially generated a given piece of text? A Stanford NLP research project (CS 224N) done i...

Mar 29, 2023 research, NLP, ML, statistics