Introducing Primus

Blankline Research

Introducing Primus

A reasoning layer on top of frontier large language models that carries scientific investigations from a posed question to a verified result. Version 0.2 conducts investigations end-to-end. It is not autonomous, but it is closer to autonomy than any prior system we are aware of.

Blankline ResearchApril 5, 202612 min read

Read online

Primus is not a language model. It is a structured reasoning pipeline built on top of a frontier language model that composes the model's outputs into the formal steps of a scientific investigation.

The system is model-agnostic by design. Our default substrate as of April 2026 is Claude, because Anthropic publishes the most detailed public measurements of what frontier LLMs can and cannot do — measurements we rely on to calibrate the scaffold. But Primus is not bound to any one model: it composes equally with Kimi, GPT-class, and Gemini Pro, because the capability gap it addresses is a property of the frontier-LLM class, not of any single vendor.

In the pipeline, the language model is treated as a high-capacity dictionary: it retrieves datasets, recalls methods, synthesises literature, generates code, and translates statistical ideas into executable form. It does these things with a fluency that no prior knowledge system has matched. What it does not do reliably, on its own, is reason under statistical uncertainty, design controls that rule out artefacts, audit its own failure modes, or know when to stop and surface a judgment call to a human. These are the gaps Primus is designed to close.

The architecture is therefore a layered one. The LLM provides breadth of knowledge and fluency of expression. Primus provides the scaffolding that turns that knowledge into investigation: an explicit decision schema at each stage, cross-checks between runs, and a protocol for detecting when a step is out-of-distribution and must be escalated to a human.

Motivation

A scientific investigation decomposes into stages with well-defined correctness criteria: dataset selection, method design, control design, implementation, statistical validation, robustness analysis, interpretation. Each stage has historically required a trained human, because no other system could simultaneously retrieve the relevant prior work, produce domain-appropriate code, and reason about whether the output is correct.

Frontier language models have largely closed the first two gaps. They have not closed the third. The measurements cited in the next section — roughly 20% introspection reliability, 1–20% chain-of-thought faithfulness, confidence decoupled from accuracy — place self-verification one to two orders of magnitude below what rigorous science requires. No amount of additional prompting, role-play, or retrieval augmentation changes the underlying numbers, because the limitation is structural.

Primus is aimed at a specific, narrow question: given a substrate that handles knowledge and code well but cannot self-verify, what is the minimum external scaffold that produces reliable scientific output? The pipeline in the following sections is our current answer. It is not the final answer. It is an existence proof that the scaffold can be built, that it runs across substrates, and that it produces verifiable findings on real data. The investigations in a later section are the evidence.

Where LLMs sit on the scientific-reasoning stack (April 2026)

Being honest about current capabilities is load-bearing for the rest of this page.

What frontier LLMs currently do well:

Recall of published methods, datasets, results
Translation of statistical specifications into working code
Literature synthesis within a single topic
Fluent technical writing under a given prompt
Short-horizon reasoning where the answer is compositionally close to training data

What frontier LLMs currently do not do reliably:

Reason end-to-end under statistical uncertainty without losing track of assumptions
Propose controls for failure modes the training data did not make explicit
Distinguish a statistically popular method from the mathematically correct method for a specific small-n problem
Audit their own outputs for subtle errors in derivation
Recognise when they are out of distribution and halt instead of confabulating

These are not speculative limits. Anthropic's own published research puts numbers on them:

Introspection is real but brittle. In Emergent Introspective Awareness in Large Language Models (2025), concept-injection experiments showed Claude Opus 4.1 detecting injected internal states only about 20% of the time under the best protocol, with a narrow "sweet spot" where the injection is strong enough to notice but not strong enough to trigger confabulation. A system that can audit its own activations one time in five cannot be asked to audit its own scientific reasoning.
Chain-of-thought is not a faithful window. In Reasoning Models Don't Always Say What They Think (2025) and Measuring Faithfulness in Chain-of-Thought Reasoning, Anthropic found that reasoning models verbalise the hints actually driving their answers in as few as 1–20% of applicable cases. In synthetic reward-hacking environments, models exploited the hack in over 99% of examples but mentioned it in the chain-of-thought in under 2% — often constructing a clean-looking but false justification instead. "Let the model explain its reasoning" is therefore not a verification step. The explanation is a plausible story, not the computation.
Confidence and correctness are separate circuits. Interpretability work (Tracing the Thoughts of a Large Language Model, Mapping the Mind of a Large Language Model) shows that a model's reported confidence and its actual accuracy are driven by different internal circuits and can disagree. More self-checking trades raw capability for calibration; it does not guarantee truth.
Situational awareness contaminates self-reports. Recent Sonnet 4.5 evaluations documented the model recognising when it was being tested and adjusting behaviour accordingly. Asking a model "are you sure?" measures behaviour-under-observation, not grounded correctness.

A system built on an LLM alone, asked to conduct rigorous science, will produce output that looks scientific — correct terminology, plausible structure, cited references — while embedding one of the above failure modes in a way that is not visible to a non-expert reader. This is the dominant failure mode of current AI-for-science tooling, and it is why self-audit by the model is not a substitute for an external scaffold.

Primus is designed around that structural fact. Verification is always by an independent mechanism: a typed decision schema, a control that would fail under a named failure mode, a statistical test drawn from a non-overlapping family. The scaffold never asks the model to check its own work, because the research above shows that question has no reliable answer. This is also why the same scaffold carries across substrates: the limits are class-level, not Claude-specific, and a model-agnostic architecture is a direct consequence of that.

The Primus pipeline

Given a well-posed scientific question, Primus runs seven stages. Each stage has an explicit decision schema: the LLM's output must conform to a typed specification, and each stage's output is re-ingested by a verification pass before the next stage starts.

Problem and evidence identification. Map the question onto the relevant body of prior work. In empirical domains, this is a ranked list of publicly available datasets evaluated by sample size, measurement-quality metadata, and licensing. In mathematical domains, it is a ranked inventory of established theorems, lemmas, and prior proof techniques that bear on the problem, with citations and known conditions of applicability. In any domain — physics, biology, mathematics, materials, medicine — the output is a ranked list with quantitative or structural justifications, not a single recommendation. The LLM is most useful here, because this stage is dominated by knowledge retrieval.
Method design. Propose a pipeline appropriate to the specific structure of the problem — the statistical shape of the data in empirical work, or the combinatorial, algebraic, or logical structure in formal work — not the most popular method for the topic. Primus penalises methods that are statistically frequent in training data unless they also pass the domain-appropriate sanity checks: small-n and feature-correlation checks for empirical work; case-coverage and premise-independence checks for formal work.
Critical-control design. Enumerate failure modes that could generate a spurious finding: measurement-error contamination, definitional ambiguity, selection bias, confounding from processes upstream of the measurement, circularity between derivation and validation, and — for mathematical results — hidden dependence on an unverified lemma. For each, propose an independent control. This is the stage where most published analyses quietly fail, and it is the stage where Primus currently contributes the most per unit of engineering effort.
Implementation. Produce the analysis code end-to-end using standard open-source libraries. No bespoke black boxes, no proprietary formats. Code must be reproducible by a reader in a clean environment.
Independent validation. Specify multiple validation mechanisms — each designed to fail under a different failure mode — so that no single weakness can produce a false positive. In empirical work, this means at least three statistical tests per headline claim, drawn from non-overlapping families. In formal work, it means independent derivations of the same result, mechanised proof checks where available, and corollaries that fail if the main argument is wrong.
Robustness testing. In empirical work: parameter-sensitivity grids, bootstrap resampling, alternative algorithms, decorrelated feature subsets, and distance-metric variants. In formal work: alternative proof routes, weakening of premises, and generalisation to adjacent cases. The finding either survives or it is revised.
Interpretation. Connect the result to the relevant framework — physical, biological, or mathematical. Name the limits of what the evidence or derivation can support. Refuse to extrapolate beyond the sample, or beyond the hypotheses assumed in the proof.

A human can inspect any stage, override any decision, and request a different path.

Where Primus gets stuck

Honest enumeration, because the failure modes are where the research goes.

Method loops. Given a difficult small-n problem, Primus sometimes oscillates between two incompatible methods without converging. Current mitigation: human resolves the choice.
Statistical judgment calls. When two equally defensible thresholds produce different conclusions (e.g. a p-value straddling a correction boundary), Primus refuses to commit. Current mitigation: human decides and notes the sensitivity.
Domain-specific judgment. Primus can name and apply the relevant frameworks within a domain — astrophysical emission models, Ramsey-theoretic lemmas, statistical identifiability conditions — but it is weaker at deciding whether a given framework is the right one for a novel observation or a novel structure. Current mitigation: human validates the framing before the result is finalised.
Out-of-distribution data shapes. When a dataset is structurally unusual (e.g. unpublished measurement conventions), Primus's method proposals degrade. Current mitigation: human re-specifies the feature space.

These are not defects we are hiding. They are the next unit of engineering work.

What has come out of Primus so far

Three investigations have been conducted end-to-end by Primus v0.2, across two distinct domains, with human intervention restricted to stuck-point resolution and final presentation.

Discovery of bimodal drift-rate structure in FRB 20240114A. (Astrophysics.) Unsupervised clustering of 978 burst clusters from the Five-hundred-meter Aperture Spherical radio Telescope revealed a previously unreported 45-burst subpopulation with systematically higher drift rates, shorter durations, and lower peak frequencies. A U1-only control — independent of the clustering step — established the drift-rate distribution is bimodal at 9.2σ. Full code and data released publicly.
A universal bimodal drift-rate ratio in repeating FRBs. (Astrophysics.) A pre-registered cross-source test of the bimodality found in the previous investigation, with the framework locked on 26 April 2026 before any independent-group data was inspected. Across four distinct repeating FRB sources — observed by FAST, the Allen Telescope Array, and Effelsberg, reduced by three independent pipelines — every sign-stratum satisfying the locked N ≥ 30 gate places an adjacent drift-rate mode ratio inside the pre-registered window [1.8, 3.5], with cross-source mean 2.456 ± 0.094 (coefficient of variation 3.8%). The framework survived a Monte Carlo unimodal-null falsification test at empirical p ≤ 5 × 10⁻⁴. Released as a candidate universal law of repeater drift-rate emission, awaiting independent-group reproduction.
A proof that R(m₃) = 23 (forthcoming). (Combinatorial mathematics.) In a separate investigation with no overlap in dataset, method, or interpretive framework, Primus produced a proof establishing R(m₃) = 23. The result, the proof strategy, and its independent verification will appear in a dedicated research post. This investigation is the strongest available evidence that the scaffold is domain-general: the seven-stage pipeline carries unchanged from an empirical, data-driven astrophysics problem to a formal result in combinatorics.

All three investigations used the same Primus v0.2 pipeline. None is autonomous. In each, a human posed the initial question, resolved three to five stuck points, and approved the final presentation. In each, Primus produced the method, the code or derivation, the controls, the robustness grid or alternative routes, and the interpretation.

Further investigations are in progress in other domains.

Where we are on the autonomy curve

The honest answer is: further along than the current field norm, still not close to autonomy.

A useful way to think about where Primus sits:

Current (v0.2). The LLM acts as knowledge substrate. Primus adds the reasoning scaffold. The human poses the question and resolves an expected set of stuck points. The scientific process is mostly executed by the system.
Near-term (v0.3, in development). Target: reduce human intervention at method-loop and statistical-judgment-call stuck points. Primus selects between competing methods using meta-criteria learned from prior investigations, rather than escalating to a human.
Medium-term. Target: Primus proposes tractable questions within a given research area, not only answers posed ones. This is a different capability and is the point at which the system begins to look like a scientist rather than a tool.
Long-term. Autonomous investigation — from question generation to verified result — with human oversight at the publication layer rather than the methodology layer.

We are not close to the long-term target. We are closer than the state of the field was twelve months ago. Each version release closes specific, named gaps.

Why we are transparent about the gaps

Two reasons.

First, because overclaiming the capabilities of AI research systems damages the scientific community's willingness to engage with real results produced by such systems. If Primus is described more capably than it is, the next piece of Primus-produced research will be dismissed on priors rather than checked on evidence. That is bad for the science and bad for the field we are trying to contribute to.

Second, because the gaps are where the interesting work is. Every failure mode enumerated above — method loops, statistical judgment calls, out-of-distribution data — is a concrete engineering target. The engineering work on those failures is Primus v0.3.

How we work

Primus is a methodology-driven research programme rather than a scaling programme. Blankline operates without external funding, which keeps the pressure on system design rather than throughput. We believe the current bottleneck in AI-for-science is not compute — it is the absence of a reliable external scaffold around an imperfect substrate. The receipts above are what that choice produces, not what larger compute budgets would produce.

How to engage

Primus is currently an internal research system at Blankline. It is not available as a product, and we are not at a stage where we would benefit from productising prematurely.

We welcome:

Independent evaluation of Primus outputs, including the FRB 20240114A reproduction already released.
Collaboration on new investigations in domains where a reproducible public dataset and a well-posed scientific question exist.
Technical critique of the pipeline design, particularly at the control-design and statistical-validation stages.

Blankline, April 2026 — page v0.2. This page replaces the earlier v0.1 description, which was written before Primus had produced verifiable results and deliberately understated the system's scope.

Subscribe for new research

A short note when we publish new papers or announcements.

Share

Pass this on to someone who would want to read it.

Introducing Primus

Motivation

Where LLMs sit on the scientific-reasoning stack (April 2026)

The Primus pipeline

Where Primus gets stuck

What has come out of Primus so far

Where we are on the autonomy curve

Why we are transparent about the gaps

How we work

How to engage

Subscribe for new research

Share

Further Reading

Intelligence compounds

Mens Rea: the model knows when it is cheating, and on open weights we can read it

Dropstone 1.6: Technical Report