From SMILES to graph transformers — designing a teaching curriculum

There is a real gap between the cheminformatics community and the graph machine learning community. Chemists with RDKit fluency often haven’t written a message-passing layer; ML engineers with PyTorch Geometric fluency often can’t tell a thiazole from a triazole. Each side underestimates the other’s basics, and most teaching material is written for one audience or the other rather than for the crossing.

chemical-graph-series is an open-source attempt at a single coherent path through both. Seven Jupyter notebooks, roughly nine to eleven hours end to end, starting at “what is a molecule, as a graph” and finishing at solubility prediction on ESOL and FreeSolv with scaffold-aware splits. This note is the pedagogical argument behind the lesson order, written for anyone thinking of using the series or building something similar.

The two-audience problem

A curriculum that tries to serve both audiences has to deliberately repay each side’s missing context without losing the other’s. In practice that means:

For chemists, the linear-algebra and deep-learning prerequisites are introduced in chemistry terms first. The Laplacian appears as “the matrix whose eigenvectors tell you about graph diffusion”, attention appears as “weighted message passing where the weights are learned”, and so on.
For ML engineers, every architectural idea is grounded on a molecular example before it generalises. The first time GAT shows up it is computing attention coefficients over the bonds of a single small molecule, not over an abstract toy graph.

The prerequisite set is intentionally short: basic Python plus an undergraduate-level grasp of molecular structure. RDKit, graph theory, spectral methods, attention, transformers, and equivariance are all introduced from scratch. The cost of that decision is a slow first notebook; the payoff is that the later notebooks don’t quietly assume content the reader was supposed to pick up elsewhere.

The seven-lesson arc

Lesson	Title	Key concepts
01	Building graphs	SMILES parsing, RDKit, Mol-to-graph conversion, feature extraction
02	Positional encoding	Laplacian eigenvectors, random-walk PE, spectral analysis
03	GAT model	Graph Attention Networks, message passing, multi-head attention
04	Sparse attention	Efficient attention for graphs, virtual edges, locality
05	Full graph transformer	Global self-attention, edge features, deep architectures
06	Advanced graph models	GraphGPS, E(3)-equivariant networks, hybrid architectures
07	Modelling & predictions	ESOL and FreeSolv property prediction, training pipelines

The interesting design questions are in the order, not the topics. Most curricula in this area pick a similar bag of topics; the ones that work are the ones that order them so each lesson explains why the next one is needed.

Why featurisation comes first

Lesson 01 is deliberately heavier than a “setup” notebook tends to be. Featurisation is where most chemistry-flavoured ML projects either succeed or fail, and it’s the one part of the pipeline where domain knowledge is genuinely indispensable. By the end of lesson 01 a reader has chosen atom features (element, hybridisation, formal charge, aromaticity), bond features (order, conjugation, stereochemistry), and seen how each choice affects the downstream graph. Every later lesson calls back to those choices.

It also forces the right mental model early: you cannot teach attention before the reader knows precisely what the model is attending over.

Why positional encoding comes before any GNN

This is the order most curricula get wrong. Graphs have no canonical ordering, so anything that pretends to be “positional” in the transformer sense has to be constructed from the graph itself. Laplacian eigenvectors and random-walk-based encodings are not optional details; they are the difference between a graph transformer that distinguishes atoms in a ring and one that thinks they are interchangeable.

Putting positional encoding in lesson 02 — before any attention model — means the reader has the right vocabulary in hand by the time GAT shows up. It also gives a natural reason to spend time on the normalised Laplacian and spectral graph theory: the eigenvectors of $L = I - D^{-1/2} A D^{-1/2}$ are not abstract; they’re literally the positional encoding the reader is about to feed into a transformer.

Why GAT before the full transformer

Local attention is the right thing to teach first because it makes the inductive bias visible. GAT only attends along bonds; the model can only know about a chemical environment by aggregating local information across layers. That constraint is the point of message passing as an inductive bias, and you can’t appreciate the trade-offs of a global graph transformer without first feeling what local attention can and can’t do.

By the time the reader hits lesson 05, “what would change if every atom could attend to every other atom?” is no longer an abstract question.

Why sparse comes before full

Lesson 04’s sparse-attention pattern (virtual edges, locality constraints) lives between local GAT and full graph transformer for two reasons:

The scaling story is honest. Full $\mathcal{O}(n^2)$ attention is fine for a 30-atom drug-like molecule and fatal for a protein fragment. Sparse patterns aren’t a hack; they’re how graph transformers actually run on anything non-trivial.
It gives a chance to talk about virtual edges explicitly — the trick where the global token, or a small set of virtual nodes, carries long-range information without paying full-attention cost. That idea reappears in the lesson 06 hybrid architectures, so it’s worth introducing in its own right rather than smuggling it in.

Why equivariance comes late

E(3)-equivariant networks are the most mathematically dense topic in the series, and they only justify their cost once the reader has felt the absence of equivariance in earlier models. A non-equivariant graph network applied to a 3D molecular geometry will famously give different predictions for the same molecule under rotation — and seeing that empirically in lesson 05/06 is the right motivation for the equivariant machinery in lesson 06.

The lesson explicitly does not claim equivariance is always the right choice — it makes the trade-off (parameter efficiency and sample efficiency versus implementation complexity and runtime) explicit, and points at situations where standard GraphGPS is the better engineering call.

Why real datasets only at the end

Lesson 07 uses ESOL (aqueous solubility) and FreeSolv (free energy of hydration), with scaffold-aware splitting rather than random splitting. That’s the lesson that finally makes the point that has been hovering in the background since lesson 01: molecular ML papers that report random-split test accuracy are usually overstating their generalisation, because the test set contains scaffolds the model has already seen during training. Scaffold splits are harder and more honest, and they make the comparison between architectures actually informative.

Saving the real datasets for the end also means the reader’s first encounter with a benchmark happens after they understand the model architectures they’re comparing — rather than the reverse, where benchmark numbers get presented before the reader has any intuition for what they mean.

What the series deliberately does not cover

A short list, because scope is a teaching decision too:

Generative models for molecules — diffusion, flow-matching, autoregressive SMILES generation. These are interesting and worth a separate series; they would double the length of this one.
Reaction prediction and retrosynthesis. Different data, different task framing, different evaluation.
Large-scale pretraining. The compute story for big molecular models is its own subject. The series stops at “you can train a competent property predictor on a single GPU”.
Quantum-mechanical features beyond what RDKit provides. Force fields, DFT-derived descriptors, and conformer ensembles are a deep rabbit hole that would derail the graph-learning narrative.

Scoping decisions like these are part of why the series fits in nine to eleven hours rather than ninety. A curriculum that tries to cover everything teaches nothing well.

How I’d use it

The notebooks run top-to-bottom — they are a sequence, not a reference. The most common ways I’d expect someone to use them:

As a self-study course. Block out a few evenings, run the notebooks in order, edit and re-run the cells rather than just reading them. Most of the learning happens when something breaks and you have to fix it.
As reading-group material. Each notebook is short enough for a 90-minute discussion. The later ones (05, 06) pair well with the original GraphGPS and E(3) papers.
As a featurisation reference. Lesson 01 is the one I link to most often outside the series — it’s the kind of thing every molecular ML project rebuilds badly from scratch.

Issues, pull requests, and corrections are welcome at github.com/CodeHalwell/chemical-graph-series. If something in a lesson is confusing or wrong, the most useful thing you can do is open an issue — teaching material gets better the more real readers it survives.