CoDHe Labs

Insight

From mass spectra to molecular graphs with spectral diffusion

Mass spectrometry is one of the workhorses of analytical chemistry, but turning a tandem MS/MS spectrum back into a molecular graph is a famously under-determined problem. Many distinct structures fragment to give very similar peaks, software has been wrestling with this for decades, and the honest framing is that any single spectrum is consistent with a set of plausible candidates rather than a unique answer.

Spec2Graph — the open-source project this note is about — takes a generative angle on that problem. Instead of trying to score candidate structures from a fixed library, it learns to generate the spectral embedding of the molecule’s Laplacian directly, conditioned on the spectrum, and then reconstructs connectivity from that embedding. The code lives at github.com/CodeHalwell/Spec2Graph and this write-up explains the design.

What the model targets

For a molecular graph GG with normalised Laplacian

L  =  I    D1/2AD1/2L \;=\; I \;-\; D^{-1/2} A \, D^{-1/2}

the spectral embedding is the matrix VkRn×kV_k \in \mathbb{R}^{n \times k} whose columns are the eigenvectors associated with the kk smallest eigenvalues of LL. Each atom is represented by its row of VkV_k — a point on a continuous “shape manifold” of the molecule.

This is a much friendlier target than a one-hot adjacency matrix:

  • It’s continuous, so diffusion has somewhere to live.
  • It’s low-rank (k ≪ n), so the network only has to predict a few numbers per atom.
  • It carries genuine structural information — distances in spectral space approximate graph diffusion distances, so spectrally close atoms are usually bonded or near-bonded.

The trade is that a spectral embedding does not let you read off bonds directly. You need a separate decoder step to turn VkV_k back into an adjacency, and that decoder has to cope with the ambiguities of spectra — which is where most of the design effort goes.

Why eigenvectors are cursed targets

There are three independent identifiability problems stacked on top of each other:

  1. Sign and rotation ambiguity. For a non-degenerate eigenvalue, VkV_k is defined only up to a sign flip on each column. For degenerate eigenvalues (very common in molecules with symmetry) the indeterminacy widens to an arbitrary orthogonal rotation within the eigenspace. A network that tries to predict raw eigenvectors is being asked to choose, consistently, one element of an equivalence class — which it cannot.
  2. Cospectral graphs. Distinct graphs can share the same Laplacian spectrum. A perfect spectral predictor is therefore not a perfect structure predictor; there’s an irreducible information loss going from GG to {λi,Vk}\{\lambda_i, V_k\}.
  3. MS isomer ambiguity. The conditioning input has its own identifiability ceiling: isomeric molecules routinely produce nearly indistinguishable MS/MS spectra. The supervision signal is therefore compatible with a small set of candidate graphs, not a unique one.

A model trained to regress raw eigenvectors gets gradient noise from problem (1) on every batch, blind spots from problem (2) on a small fraction of molecules, and unavoidable conditional entropy from problem (3). The first two are addressable in the loss; the third is intrinsic to the data.

The fix: train on the subspace, not the basis

The standard escape from sign/rotation ambiguity in graph spectral learning is to drop the basis altogether and supervise the projection matrix of the subspace instead:

Pk  =  VkVk ⁣Rn×n.P_k \;=\; V_k V_k^{\!\top} \,\in\, \mathbb{R}^{n \times n}.

PkP_k is invariant to any orthogonal change of basis within the eigenspace — exactly the symmetry the raw eigenvectors don’t respect. Spec2Graph supervises with a projection-aware loss in DiffusionTrainer, backed by a projection_matrix helper in SpectralDataProcessor.

Two practical pieces sit on top of that:

  • Orthonormality regulariser. The diffusion model still produces a basis (you need columns, not just a projection, to feed downstream components). A soft penalty on V^k ⁣V^kI\hat V_k^{\!\top} \hat V_k - I keeps the predicted columns well-conditioned.
  • Sign canonicalisation. During data preparation, each eigenvector column is multiplied by ±1\pm 1 to enforce a fixed convention (e.g. the first non-zero coordinate is positive). This stabilises the marginal statistics the diffusion model sees during training, even though the loss itself is sign-invariant.

Problem (2) — cospectrality — is left as an explicit limitation of the approach. Spec2Graph is not trying to be a unique-structure oracle; it’s trying to produce a plausible spectral embedding consistent with the observed spectrum.

Architecture

The pipeline has four components, each doing one job:

ComponentRole
SpectralDataProcessorSMILES → Laplacian → top-kk eigenvectors with sign canonicalisation.
FourierMzEmbeddingEncodes m/zm/z peak positions using Fourier features.
Spec2GraphDiffusionTransformer encoder–decoder denoiser, conditioned on the spectrum.
DiffusionTrainerDDPM training loop with projection-aware loss and ancestral sampling.

The transformer pattern is the familiar one. The encoder ingests peaks (m/z, intensity) with Fourier m/zm/z embeddings; it produces a spectrum context ZZ. The decoder receives the current noisy eigenvector matrix V^k(t)\hat V_k^{(t)} plus a sinusoidal timestep embedding, cross-attends to ZZ, and predicts the noise (DDPM ϵ\epsilon-prediction).

config = Spec2GraphDiffusionConfig(
    d_model=256,
    nhead=8,
    num_encoder_layers=4,
    num_decoder_layers=4,
    dim_feedforward=1024,
    k=8,            # number of eigenvectors / spectral coordinates per atom
    max_atoms=64,
    max_peaks=100,
    dropout=0.1,
)
model = Spec2GraphDiffusion(config)

Two auxiliary heads run alongside the main eigenvector denoising:

  • A Morgan fingerprint head — supervises a coarse chemical descriptor of the molecule, anchoring the spectrum encoder to some chemistry even when the eigenvector loss is dominated by noise schedule curriculum.
  • An atom-count head — predicts nn from the spectrum, used both as a weak auxiliary loss and at inference time to size the generated embedding correctly.

Masks (atom_mask, spectrum_mask) carry True for valid entries and are validated at runtime, so a batch item with no valid atoms or no valid peaks fails loudly rather than silently producing garbage attention.

The diffusion process

Standard DDPM, applied to the eigenvector tensor:

  • Forward: gradually add Gaussian noise to VkV_k over n_timesteps (default 1000) with a linear β\beta schedule from beta_start = 1e-4 to beta_end = 0.02.
  • Reverse: the transformer predicts the noise added at each step, conditioned on the spectrum context.
  • Loss: a weighted sum of (a) the standard ϵ\epsilon-prediction MSE on the noisy eigenvectors, (b) the projection-aware loss on P^kPk\hat P_k - P_k, (c) the orthonormality penalty, and (d) the two auxiliary heads.
config = TrainerConfig(
    n_timesteps=1000,
    beta_start=1e-4,
    beta_end=2e-2,
    projection_loss_weight=1.0,  # subspace-invariant P_k supervision
)
trainer = DiffusionTrainer(model=model, config=config, device="cuda")

# Single training step
loss = trainer.train_step(optimizer, x_0=V_k, mz=mz, intensity=intensity)

At sampling time the trainer runs ancestral DDPM in reverse: start from N(0,I)\mathcal{N}(0, I) at the correct atom count, denoise step-by-step under spectrum conditioning, and return V^k\hat V_k ready for downstream reconstruction.

From spectral embedding to bonds

A predicted V^k\hat V_k is not yet a graph; it’s coordinates. The current implementation uses a transformer decoder to produce bond logits from the eigenvectors, which is the simplest thing that works and which inherits the encoder–decoder attention machinery already in the model.

A planned alternative — a Spectral Graph Neural Operator (SGNO) — treats the eigenvectors as coordinates on a continuous manifold and learns a kernel that turns pairs of coordinates into a bond potential:

class SpectralKernel(nn.Module):
    """Learnable kernel over spectral coordinates."""
    def __init__(self, spectral_dim, hidden_dim=64):
        super().__init__()
        self.kernel_mlp = nn.Sequential(
            nn.Linear(2 * spectral_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, E):
        # E: [batch, n_atoms, spectral_dim]
        n_atoms = E.shape[1]
        E_i = E.unsqueeze(2).expand(-1, n_atoms, n_atoms, -1)
        E_j = E.unsqueeze(1).expand(-1, n_atoms, n_atoms, -1)
        grid = torch.cat([E_i, E_j], dim=-1)
        pot  = self.kernel_mlp(grid).squeeze(-1)
        return (pot + pot.transpose(1, 2)) / 2          # symmetric bond potential

Three properties make the SGNO interesting:

  • Smoothness in spectral space. Atoms with similar spectral coordinates get similar predicted connectivity, which is exactly the inductive bias the Laplacian embedding was constructed for.
  • Global context for free. The kernel sees long-range structure through the spectral coordinates, without needing explicit message-passing depth to propagate it.
  • Resolution invariance. The kernel is a function of pairwise spectral coordinates, not of a fixed atom-count grid, so the same decoder applies to molecules of any size.

The cost is the dense N×NN \times N expansion of the kernel grid. For larger systems, chunking the rows or switching to a distance-based kernel (e.g. torch.cdist on the spectral coordinates) keeps memory manageable. That tradeoff sits in the SGNO module rather than the diffusion model itself, so the two stages can evolve independently.

What I keep an eye on

A short list of failure modes and open questions, in roughly the order they tend to bite:

  • Projection loss vs. eigenvector loss balance. Too much weight on the eigenvector MSE and the model starts trying to fit a single basis representative, fighting the sign/rotation symmetry; too little and the columns drift away from orthonormality. The current default of projection_loss_weight = 1.0 is a working starting point but not obviously optimal across atom counts.
  • Cospectral mass. What fraction of the training set lies on cospectral collisions, and does the model degrade gracefully on those? Cospectrality is fundamental, but it’s worth knowing how often it actually bites.
  • MS isomer plateaus. The conditional distribution of structures given a spectrum has support on multiple isomers. The metric I want is calibrated coverage of that set — a top-kk similarity over reverse-diffusion samples is a better fit than a single-point accuracy.
  • Padding pathologies. Mask conventions are checked at runtime, but a small bug in mask construction silently corrupts cross-attention. Padding tests stay near the top of the test suite for that reason.

Where this is going

The next concrete milestones are honest evaluation on a public dataset and a real comparison between the transformer decoder and the SGNO kernel. The recently merged work integrates the MassSpecGym benchmark — dataset, eval harness, and SGNO training — so the next set of results will be on that, rather than on the synthetic padded batches in spectral_diffusion.py.

Spectral diffusion is not going to “solve” MS-to-structure on its own; the limits are partly information-theoretic and partly chemistry-shaped. What it can plausibly do is produce a calibrated set of structure hypotheses for a given spectrum, in a representation that downstream chemists and other models can both read.