Dossier · TechnologyN 58.5° / E 19.0°

Inside the
detection
engine.

Model class

PU-XGBoost

Calibration

Elkan & Noto

Attribution

TreeSHAP

01The problem

A vessel does not announce itself.

Most ships are not shadow ships.
Of the ones that are, none of them say so.

Detection is not pattern-matching on a single ship. Any one vessel can spend a quiet weekend at sea. Transponders fail. Crews sleep. Routes divert. Looked at in isolation, almost nothing about a hull is self-incriminating.

What is incriminating is the joint shape of a hundred small choices — which meridian a transponder goes dark at, how long it stays dark, what the deck looks like on either side of the gap, where the vessel surfaces, and whether anyone it likes to meet mid-sea has surfaced at a similar place at a similar time.

These patterns are too quiet for rules and too contingent for averages. They live in the joint distribution of dozens of features — exactly the kind of thing a learned model is built to recover.

02Labels

We know which ships are guilty.

We do not know which ships are innocent.

A normal classifier wants two clean piles: positives and negatives. Cat photos and not-cat photos. Email and not-email.

Maritime intelligence does not deliver that. We have a curated list of vessels that have been confirmed to belong to the shadow fleet — by sanctions designations, by visual confirmation, by named-and-shamed reporting. We have an enormous pool of every other vessel afloat. We do not have a list of vessels that have been confirmed clean. Most of them are clean. Some of them are shadow vessels that no one has named yet.

This is the positive-unlabelled problem. The model trains on positive labels and a sea of unlabelled evidence — and must learn to score the unlabelled without pretending it is all negative.

Supervised — what we wish we had

Every vessel carries a verified label.

feature₁feature₂

Positive-unlabelled — what we actually have

A few confirmed shadow vessels. Everything else is ambiguous.

feature₁feature₂hidden positivehidden positivehidden positivehidden positivehidden positivehidden positivehidden positive

// Elkan & Noto, 2008

Cited in countless fraud-detection papers, the Elkan & Noto trick estimates a single constant — the probability that a true positive in the wild has been labelled — and divides the model's output by it. The score we display has been through that correction.

// What this changes in practice

A vessel scoring 0.4 under a naive classifier might be a 0.65 under PU-corrected scoring. The shape of the ranking barely changes; the absolute numbers — the threshold an operator triages against — do.

03The model

One tree is a guess. A forest of corrections is a model.

XGBoost, conceptually:
an argument that gets sharper with every tree.

XGBoost stands for extreme gradient boosting, which is more colourful than it sounds. Stripped of the engineering, it is built from two ideas: decision trees and boosting.

A decision tree is the simplest kind of classifier — a flow chart. At each fork it asks one question of one feature, sends the case down a branch, and at the leaf hands back a number. Is the AIS gap longer than 47 minutes? Yes → that way. Has the vessel changed flag twice in two years? No → this way. A single tree is a coarse instrument.

Boosting is the trick that turns coarse instruments into a precise one. You fit a tree to the data — it gets some answers right and some wrong — and then you fit a second tree, but only to the mistakes the first tree made. Then a third tree, only to what the first two still got wrong. Each new tree is small, biased, and specialised on the residue of all the trees before it. Add them up and the ensemble converges on something almost no single tree could express.

Gradient is just the mathematical recipe for "fit the next tree to the mistake" — it picks the direction in which each new tree should pull the prediction. Extreme refers to the engineering: sparsity-aware splits, regularisation against overfit, and a tree-construction algorithm that runs in parallel and treats out-of-core data gracefully. The conceptual story, though, is the cascade.

fig.07 · gradient boosted trees · residual cascade0 / 5 trees fittedtree 01+0.55-0.15+0.40-0.05tree 02+0.22+0.18-0.10+0.05tree 03+0.12-0.05+0.08+0.00tree 04+0.05+0.03-0.02+0.02tree 05+0.02+0.01-0.01+0.00residualresidualresidualresidualcalibrated score0.65p(shadow | x)decision splitleaf correctionresidual to next tree

// Tree oneA blunt first cut. Decides on a single feature — say, the duration of an AIS silence — and produces a coarse first estimate.

// Tree NEach subsequent tree is grown to fit only the residual — what the previous trees still got wrong. The corrections get smaller, the ensemble gets sharper.

// ConvergenceHundreds of small trees, summed and calibrated, produce a probability we can act on. No tree alone is wise — together they are.

Live model · 5 of ~340 trees shownCalibrated p = 0.72 on this exampleSHAP attribution per node
04Why this model

A deliberate choice.

We chose a tree ensemble over a neural network.
Three reasons.

Reason 1

Heterogeneous features

AIS speed in knots. Encounter counts. Categorical flags. Hours-since-last-port. Tree ensembles handle messy, mixed, missing-value tabular data without scaling tricks or imputation rituals. Networks demand homogeneity we cannot honestly manufacture.

Reason 2

Tractable explanations

Every prediction can be split exactly into per-feature contributions via the TreeSHAP algorithm, in milliseconds, with the same fidelity as the model itself. There is no neural equivalent that survives an analyst's questions.

Reason 3

Modest data, sharp signal

We train on tens of thousands of frames, not tens of millions. In that regime gradient-boosted trees outperform networks on every public maritime-classification benchmark we have run, and on our private ones.

05Calibration

From a score to a probability.

A model that says 0.74 had better mean it.

Raw model output is a number, not a probability. By default a boosted tree is over-confident — it pushes scores out to the edges where it has the least information. A "0.95" might really mean "the model is 70% certain"; a "0.20" might mean "the model genuinely doesn't know".

Calibration is the discipline of bending the model's output so that scores match empirical frequencies. If we look at a hundred vessels the model scored "0.80" and verify them, we should find that close to 80 of them are in fact shadow vessels. The Elkan-Noto calibration we use handles the additional twist of PU learning — that the unlabelled pile contains hidden positives.

The curve to the right is the function we apply. The dashed line is what the model says raw. The solid line is what we display.

0.000.250.500.751.00-4-2024raw boosted score (log-odds)probability of shadow0.080.431.001.001.00calibrated (Elkan-Noto)raw boosted scorefig.08 · score → probability
06Attribution

Every score, fully decomposable.

No prediction lands without showing
its receipts.

The complaint that haunts every machine-learning system in defence work is the same: but why? An analyst will not act on a number they cannot interrogate. So every score the model produces ships with a complete decomposition into per-feature contributions, via the TreeSHAP algorithm — exact, fast, and mathematically faithful to the model that produced the score.

Read the waterfall below as a sentence. We begin at the population base rate. Each row adds or subtracts a measured amount. The final number at the bottom is the displayed probability. Nothing else is in the picture; nothing in the picture is hidden.

fig.09 · SHAP decomposition · MMSI 273•••••• · window 04:00–06:00 UTC0.000.250.500.751.00base rate0.18AIS silence (62 min)+0.18Speed variance+0.11Course delta near choke+0.09Encounter w/ flagged vessel+0.07Flag history (5 changes)+0.05Tonnage class-0.03Build year (1998)+0.02Recent port (clean)-0.04final score0.63
07Closed loop

The model is not finished.

Every confirmed score from an operator
comes back into the training set.

The console isn't a one-way pane. Operator dispositions — confirm, reject, defer — are written back into the ground-truth label store, with provenance. The model is retrained on the expanded set when we decide it's worth it — manually, from the CLI. Between runs the deployed predictor is frozen; nothing retrains in the background.

That means the system grows into the conditions of its specific deployment. The vocabulary of shadow behaviour in the Gulf of Finland is not identical to the vocabulary in the Kerch Strait — the model that watches each can specialise without retraining the others.

fig.10 · closed-loop calibrationmodel scoresp(shadow | x)triage feedranked by p, gated by SHAPoperator reviewconfirm / reject / unsurelabel storeappended to training setthe model keeps learningrecalibrates every operator week

A scored vessel is the
start of the work.