Dossier · TechnologyN 58.5° / E 19.0°Conceptual brief

Inside the
detection
engine.

Open the console

Model class

PU-XGBoost

Calibration

Elkan & Noto

Attribution

TreeSHAP

01The problem

A vessel does not announce itself.

Most ships are not shadow ships.
Of the ones that are, none of them say so.

Detection is not pattern-matching on a single ship. Any one vessel can spend a quiet weekend at sea. Transponders fail. Crews sleep. Routes divert. Looked at in isolation, almost nothing about a hull is self-incriminating.

What is incriminating is the joint shape of a hundred small choices — which meridian a transponder goes dark at, how long it stays dark, what the deck looks like on either side of the gap, where the vessel surfaces, and whether anyone it likes to meet mid-sea has surfaced at a similar place at a similar time.

These patterns are too quiet for rules and too contingent for averages. They live in the joint distribution of dozens of features — exactly the kind of thing a learned model is built to recover.

02Labels

We know which ships are guilty.

We do not know which ships are innocent.

A normal classifier wants two clean piles: positives and negatives. Cat photos and not-cat photos. Email and not-email.

Maritime intelligence does not deliver that. We have a curated list of vessels that have been confirmed to belong to the shadow fleet — by sanctions designations, by visual confirmation, by named-and-shamed reporting. We have an enormous pool of every other vessel afloat. We do not have a list of vessels that have been confirmed clean. Most of them are clean. Some of them are shadow vessels that no one has named yet.

This is the positive-unlabelled problem. The model trains on positive labels and a sea of unlabelled evidence — and must learn to score the unlabelled without pretending it is all negative.

Supervised — what we wish we had

Every vessel carries a verified label.

Positive-unlabelled — what we actually have

A few confirmed shadow vessels. Everything else is ambiguous.

// Elkan & Noto, 2008

Cited in countless fraud-detection papers, the Elkan & Noto trick estimates a single constant — the probability that a true positive in the wild has been labelled — and divides the model's output by it. The score we display has been through that correction.

// What this changes in practice

A vessel scoring 0.4 under a naive classifier might be a 0.65 under PU-corrected scoring. The shape of the ranking barely changes; the absolute numbers — the threshold an operator triages against — do.

03The model

One tree is a guess. A forest of corrections is a model.

XGBoost, conceptually:
an argument that gets sharper with every tree.

XGBoost stands for extreme gradient boosting, which is more colourful than it sounds. Stripped of the engineering, it is built from two ideas: decision trees and boosting.

A decision tree is the simplest kind of classifier — a flow chart. At each fork it asks one question of one feature, sends the case down a branch, and at the leaf hands back a number. Is the AIS gap longer than 47 minutes? Yes → that way. Has the vessel changed flag twice in two years? No → this way. A single tree is a coarse instrument.

Boosting is the trick that turns coarse instruments into a precise one. You fit a tree to the data — it gets some answers right and some wrong — and then you fit a second tree, but only to the mistakes the first tree made. Then a third tree, only to what the first two still got wrong. Each new tree is small, biased, and specialised on the residue of all the trees before it. Add them up and the ensemble converges on something almost no single tree could express.

Gradient is just the mathematical recipe for "fit the next tree to the mistake" — it picks the direction in which each new tree should pull the prediction. Extreme refers to the engineering: sparsity-aware splits, regularisation against overfit, and a tree-construction algorithm that runs in parallel and treats out-of-core data gracefully. The conceptual story, though, is the cascade.

// Tree oneA blunt first cut. Decides on a single feature — say, the duration of an AIS silence — and produces a coarse first estimate.

// Tree NEach subsequent tree is grown to fit only the residual — what the previous trees still got wrong. The corrections get smaller, the ensemble gets sharper.

// ConvergenceHundreds of small trees, summed and calibrated, produce a probability we can act on. No tree alone is wise — together they are.

Live model · 5 of ~340 trees shownCalibrated p = 0.72 on this exampleSHAP attribution per node

04Why this model

A deliberate choice.

We chose a tree ensemble over a neural network.
Three reasons.

Reason 1

Heterogeneous features

AIS speed in knots. Encounter counts. Categorical flags. Hours-since-last-port. Tree ensembles handle messy, mixed, missing-value tabular data without scaling tricks or imputation rituals. Networks demand homogeneity we cannot honestly manufacture.

Reason 2

Tractable explanations

Every prediction can be split exactly into per-feature contributions via the TreeSHAP algorithm, in milliseconds, with the same fidelity as the model itself. There is no neural equivalent that survives an analyst's questions.

Reason 3

Modest data, sharp signal

We train on tens of thousands of frames, not tens of millions. In that regime gradient-boosted trees outperform networks on every public maritime-classification benchmark we have run, and on our private ones.

05Calibration

From a score to a probability.

A model that says 0.74 had better mean it.

Raw model output is a number, not a probability. By default a boosted tree is over-confident — it pushes scores out to the edges where it has the least information. A "0.95" might really mean "the model is 70% certain"; a "0.20" might mean "the model genuinely doesn't know".

Calibration is the discipline of bending the model's output so that scores match empirical frequencies. If we look at a hundred vessels the model scored "0.80" and verify them, we should find that close to 80 of them are in fact shadow vessels. The Elkan-Noto calibration we use handles the additional twist of PU learning — that the unlabelled pile contains hidden positives.

The curve to the right is the function we apply. The dashed line is what the model says raw. The solid line is what we display.

06Attribution

Every score, fully decomposable.

No prediction lands without showing
its receipts.

The complaint that haunts every machine-learning system in defence work is the same: but why? An analyst will not act on a number they cannot interrogate. So every score the model produces ships with a complete decomposition into per-feature contributions, via the TreeSHAP algorithm — exact, fast, and mathematically faithful to the model that produced the score.

Read the waterfall below as a sentence. We begin at the population base rate. Each row adds or subtracts a measured amount. The final number at the bottom is the displayed probability. Nothing else is in the picture; nothing in the picture is hidden.

07Closed loop

The model is not finished.

Every confirmed score from an operator
comes back into the training set.

The console isn't a one-way pane. Operator dispositions — confirm, reject, defer — are written back into the ground-truth label store, with provenance. The model is retrained on the expanded set when we decide it's worth it — manually, from the CLI. Between runs the deployed predictor is frozen; nothing retrains in the background.

That means the system grows into the conditions of its specific deployment. The vocabulary of shadow behaviour in the Gulf of Finland is not identical to the vocabulary in the Kerch Strait — the model that watches each can specialise without retraining the others.

A scored vessel is the
start of the work.

Open the console

Inside thedetectionengine.

Most ships are not shadow ships.Of the ones that are, none of them say so.