Insight
From mass spectra to molecular graphs with spectral diffusion
Mass spectrometry is one of the workhorses of analytical chemistry, but turning a tandem MS/MS spectrum back into a molecular graph is a famously under-determined problem. Many distinct structures fragment to give very similar peaks, software has been wrestling with this for decades, and the honest framing is that any single spectrum is consistent with a set of plausible candidates rather than a unique answer.
Spec2Graph — the open-source project this note is about — takes a generative angle on that problem. Instead of trying to score candidate structures from a fixed library, it learns to generate the spectral embedding of the molecule’s Laplacian directly, conditioned on the spectrum, and then reconstructs connectivity from that embedding. The code lives at github.com/CodeHalwell/Spec2Graph and this write-up explains the design.
What the model targets
For a molecular graph with normalised Laplacian
the spectral embedding is the matrix whose columns are the eigenvectors associated with the smallest eigenvalues of . Each atom is represented by its row of — a point on a continuous “shape manifold” of the molecule.
This is a much friendlier target than a one-hot adjacency matrix:
- It’s continuous, so diffusion has somewhere to live.
- It’s low-rank (
k ≪ n), so the network only has to predict a few numbers per atom. - It carries genuine structural information — distances in spectral space approximate graph diffusion distances, so spectrally close atoms are usually bonded or near-bonded.
The trade is that a spectral embedding does not let you read off bonds directly. You need a separate decoder step to turn back into an adjacency, and that decoder has to cope with the ambiguities of spectra — which is where most of the design effort goes.
Why eigenvectors are cursed targets
There are three independent identifiability problems stacked on top of each other:
- Sign and rotation ambiguity. For a non-degenerate eigenvalue, is defined only up to a sign flip on each column. For degenerate eigenvalues (very common in molecules with symmetry) the indeterminacy widens to an arbitrary orthogonal rotation within the eigenspace. A network that tries to predict raw eigenvectors is being asked to choose, consistently, one element of an equivalence class — which it cannot.
- Cospectral graphs. Distinct graphs can share the same Laplacian spectrum. A perfect spectral predictor is therefore not a perfect structure predictor; there’s an irreducible information loss going from to .
- MS isomer ambiguity. The conditioning input has its own identifiability ceiling: isomeric molecules routinely produce nearly indistinguishable MS/MS spectra. The supervision signal is therefore compatible with a small set of candidate graphs, not a unique one.
A model trained to regress raw eigenvectors gets gradient noise from problem (1) on every batch, blind spots from problem (2) on a small fraction of molecules, and unavoidable conditional entropy from problem (3). The first two are addressable in the loss; the third is intrinsic to the data.
The fix: train on the subspace, not the basis
The standard escape from sign/rotation ambiguity in graph spectral learning is to drop the basis altogether and supervise the projection matrix of the subspace instead:
is invariant to any orthogonal change of basis within the
eigenspace — exactly the symmetry the raw eigenvectors don’t respect.
Spec2Graph supervises with a projection-aware loss in DiffusionTrainer,
backed by a projection_matrix helper in SpectralDataProcessor.
Two practical pieces sit on top of that:
- Orthonormality regulariser. The diffusion model still produces a basis (you need columns, not just a projection, to feed downstream components). A soft penalty on keeps the predicted columns well-conditioned.
- Sign canonicalisation. During data preparation, each eigenvector column is multiplied by to enforce a fixed convention (e.g. the first non-zero coordinate is positive). This stabilises the marginal statistics the diffusion model sees during training, even though the loss itself is sign-invariant.
Problem (2) — cospectrality — is left as an explicit limitation of the approach. Spec2Graph is not trying to be a unique-structure oracle; it’s trying to produce a plausible spectral embedding consistent with the observed spectrum.
Architecture
The pipeline has four components, each doing one job:
| Component | Role |
|---|---|
SpectralDataProcessor | SMILES → Laplacian → top- eigenvectors with sign canonicalisation. |
FourierMzEmbedding | Encodes peak positions using Fourier features. |
Spec2GraphDiffusion | Transformer encoder–decoder denoiser, conditioned on the spectrum. |
DiffusionTrainer | DDPM training loop with projection-aware loss and ancestral sampling. |
The transformer pattern is the familiar one. The encoder ingests
peaks (m/z, intensity) with Fourier embeddings; it produces a
spectrum context . The decoder receives the current noisy
eigenvector matrix plus a sinusoidal timestep embedding,
cross-attends to , and predicts the noise (DDPM -prediction).
config = Spec2GraphDiffusionConfig(
d_model=256,
nhead=8,
num_encoder_layers=4,
num_decoder_layers=4,
dim_feedforward=1024,
k=8, # number of eigenvectors / spectral coordinates per atom
max_atoms=64,
max_peaks=100,
dropout=0.1,
)
model = Spec2GraphDiffusion(config)
Two auxiliary heads run alongside the main eigenvector denoising:
- A Morgan fingerprint head — supervises a coarse chemical descriptor of the molecule, anchoring the spectrum encoder to some chemistry even when the eigenvector loss is dominated by noise schedule curriculum.
- An atom-count head — predicts from the spectrum, used both as a weak auxiliary loss and at inference time to size the generated embedding correctly.
Masks (atom_mask, spectrum_mask) carry True for valid entries and
are validated at runtime, so a batch item with no valid atoms or no
valid peaks fails loudly rather than silently producing garbage
attention.
The diffusion process
Standard DDPM, applied to the eigenvector tensor:
- Forward: gradually add Gaussian noise to over
n_timesteps(default1000) with a linear schedule frombeta_start = 1e-4tobeta_end = 0.02. - Reverse: the transformer predicts the noise added at each step, conditioned on the spectrum context.
- Loss: a weighted sum of (a) the standard -prediction MSE on the noisy eigenvectors, (b) the projection-aware loss on , (c) the orthonormality penalty, and (d) the two auxiliary heads.
config = TrainerConfig(
n_timesteps=1000,
beta_start=1e-4,
beta_end=2e-2,
projection_loss_weight=1.0, # subspace-invariant P_k supervision
)
trainer = DiffusionTrainer(model=model, config=config, device="cuda")
# Single training step
loss = trainer.train_step(optimizer, x_0=V_k, mz=mz, intensity=intensity)
At sampling time the trainer runs ancestral DDPM in reverse: start from at the correct atom count, denoise step-by-step under spectrum conditioning, and return ready for downstream reconstruction.
From spectral embedding to bonds
A predicted is not yet a graph; it’s coordinates. The current implementation uses a transformer decoder to produce bond logits from the eigenvectors, which is the simplest thing that works and which inherits the encoder–decoder attention machinery already in the model.
A planned alternative — a Spectral Graph Neural Operator (SGNO) — treats the eigenvectors as coordinates on a continuous manifold and learns a kernel that turns pairs of coordinates into a bond potential:
class SpectralKernel(nn.Module):
"""Learnable kernel over spectral coordinates."""
def __init__(self, spectral_dim, hidden_dim=64):
super().__init__()
self.kernel_mlp = nn.Sequential(
nn.Linear(2 * spectral_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, 1),
)
def forward(self, E):
# E: [batch, n_atoms, spectral_dim]
n_atoms = E.shape[1]
E_i = E.unsqueeze(2).expand(-1, n_atoms, n_atoms, -1)
E_j = E.unsqueeze(1).expand(-1, n_atoms, n_atoms, -1)
grid = torch.cat([E_i, E_j], dim=-1)
pot = self.kernel_mlp(grid).squeeze(-1)
return (pot + pot.transpose(1, 2)) / 2 # symmetric bond potential
Three properties make the SGNO interesting:
- Smoothness in spectral space. Atoms with similar spectral coordinates get similar predicted connectivity, which is exactly the inductive bias the Laplacian embedding was constructed for.
- Global context for free. The kernel sees long-range structure through the spectral coordinates, without needing explicit message-passing depth to propagate it.
- Resolution invariance. The kernel is a function of pairwise spectral coordinates, not of a fixed atom-count grid, so the same decoder applies to molecules of any size.
The cost is the dense expansion of the kernel grid. For
larger systems, chunking the rows or switching to a distance-based
kernel (e.g. torch.cdist on the spectral coordinates) keeps memory
manageable. That tradeoff sits in the SGNO module rather than the
diffusion model itself, so the two stages can evolve independently.
What I keep an eye on
A short list of failure modes and open questions, in roughly the order they tend to bite:
- Projection loss vs. eigenvector loss balance. Too much weight on
the eigenvector MSE and the model starts trying to fit a single basis
representative, fighting the sign/rotation symmetry; too little and
the columns drift away from orthonormality. The current default of
projection_loss_weight = 1.0is a working starting point but not obviously optimal across atom counts. - Cospectral mass. What fraction of the training set lies on cospectral collisions, and does the model degrade gracefully on those? Cospectrality is fundamental, but it’s worth knowing how often it actually bites.
- MS isomer plateaus. The conditional distribution of structures given a spectrum has support on multiple isomers. The metric I want is calibrated coverage of that set — a top- similarity over reverse-diffusion samples is a better fit than a single-point accuracy.
- Padding pathologies. Mask conventions are checked at runtime, but a small bug in mask construction silently corrupts cross-attention. Padding tests stay near the top of the test suite for that reason.
Where this is going
The next concrete milestones are honest evaluation on a public dataset
and a real comparison between the transformer decoder and the SGNO
kernel. The recently merged work integrates the
MassSpecGym benchmark
— dataset, eval harness, and SGNO training — so the next set of
results will be on that, rather than on the synthetic padded batches
in spectral_diffusion.py.
Spectral diffusion is not going to “solve” MS-to-structure on its own; the limits are partly information-theoretic and partly chemistry-shaped. What it can plausibly do is produce a calibrated set of structure hypotheses for a given spectrum, in a representation that downstream chemists and other models can both read.