Lumenais

Scoring Methodology

We tested the promise directly.

Add Lumenais on top of the same base-model provider, then compare the resulting workflow against the direct baseline. The question is simple: does the added layer produce better decisions, stronger task selection, and more grounded outputs?

What changed

Persistent memory, synthesis scaffolding, routing, and benchmarked research workflows.

What improved

Higher average reasoning quality, better first moves, and more reliable constraint adherence.

How to trust it

Same-provider comparisons, public-safe evidence labels, dates, caveats, and deeper review materials when access is appropriate.

How scoring works

Reasoning quality is internally scored by a fixed programmatic rubric over artifact fit, prompt specificity, steering usefulness, creativity signals, and generic meta-answer penalties. Exactness uses callable-backed deterministic questions. Task selection uses approved-gold lens-family labels. Semantic grounding uses curated ambiguity-control cases. No blinded human-rater claim is made here; this is a reproducible workflow-scaffolding benchmark, not a provider replacement benchmark.

How the broader diagnostic suite (v4) was run

Prompt set

203 prompts across 7 families, balanced at 29 prompts each.

Arms

Direct same-provider baseline versus the Lumenais-guided workflow.

Held constant

Provider family, prompt set, API path class, and scoring rubric.

What changed

Memory/context gating, quick-lite routing, grounding, selector behavior, and manifold posture.

Statistics

Paired wins/losses/ties, mean delta, median delta, bootstrap 95% CI, and sign test.

Promotion rule

v2 stays promoted; v4 is diagnostic because it is larger and more realistic while task selection remains active work.

Governed Continual Learning

The category claim has to show both adaptation and restraint.

Lumenais uses “governed continual learning” narrowly: validated outcomes can change future retrieval, routing, and memory influence, while scope, confidence, compression, and telemetry gates decide what is allowed to carry forward.

It is not base-model fine-tuning, preference-data training, plain retrieval, or context compaction. The benchmarks below separate the two halves of the claim: what gets reused, and what is blocked from becoming a future prior.

Reviewed context persists

98.96%

In a 32-case live governed-memory benchmark, Lumenais recovered current reviewed project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.

Scoped memory retrieval plus correction precedence.

Bad priors are blocked

5/5

A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.

Write gates, audience scope, hub compression, and telemetry checks.

Structure transfers across domains

~+13%

Across 150 governed-versus-baseline runs on five domain pairs, cross-domain transfer measured about 13% accuracy uplift over baseline.

Domain manifolds blend through a governed bus.

Workflow quality improves

+48.6%

Against the same-provider direct baseline, Lumenais improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.

The same-provider baseline isolates the Lumenais layer.

Rejected-learning control example

The governed-memory suite includes later turns that introduce discarded brainstorming, superseding corrections, and unrelated project context. Passing behavior means the system can preserve reviewed work while preventing stale, noisy, or out-of-scope material from becoming the answer’s hidden prior.

Core Lumenais Workflow

The promoted public suite.

The current public claim uses the v2 suite because it cleanly supports the headline reasoning, exactness, task-selection, and semantic-grounding claims together.

Reasoning quality

+48.6%

0.3740 to 0.5556

Grounding fit

100%

0.9464 to 1.0000

Task selection

+0.40

0.00 to 0.40

Exact correctness

100%

100% to 100%

Reviewer Stress Test

The larger suite makes the reasoning claim more believable, not less.

In the broader diagnostic suite (v4), Lumenais measured a 52.2% relative reasoning lift with 155 wins, 5 losses, and 43 ties; exact correctness held at 100% on 24 deterministic tasks, and semantic grounding held at 100% across 32 ambiguity-control cases.

Broader diagnostic suite, not yet the promoted public headline; it preserves the reasoning lift at larger sample size while showing task-selection remains an active reliability target.

Reasoning lift

+52.2%

0.3509 to 0.5340

Paired wins

155/5/43

wins / losses / ties

Mean-delta CI95

[0.1668, 0.1976]

Exact correctness

100%

24 deterministic tasks

Semantic grounding

100%

32 ambiguity-control cases

Task selection

38.98%

0.0000 to 0.3898

What this tells us

The diagnostic suite is useful precisely because it is broader: the paired reasoning lift persisted at larger sample size, exactness held, and semantic grounding held after repair. The remaining softness is narrow and visible: task-selection stayed slightly below the smaller promoted suite, so the headline remains pinned to v2 while the diagnostic suite gives reviewers the more realistic stress picture. For where the lift concentrates, see Mechanism Ablation below.

Mechanism Ablation

The lift comes from the quick-lite reasoning path.

A router-conditioned ablation of the broader diagnostic suite (v4) shows the lift is concentrated in quick-lite-routed rows: 164 quick-lite prompts improved from 0.3507 to 0.5777 with 155 wins, 2 losses, and 7 ties, while 39 direct-high-retained rows were essentially flat.

Router-conditioned ablation, not a randomized forced-routing experiment; prompt difficulty may differ between quick-lite-eligible and direct-high-retained rows.

Quick-lite rows

164

0.3507 to 0.5777

Quick-lite wins

155/2/7

wins / losses / ties

Quick-lite mean delta

+0.2270

Direct-high rows

0.3516 to 0.3500

Governed Memory

Memory governance, not just recall.

The test is intentionally narrow: it asks whether the system keeps the current reviewed project facts straight when later turns include discarded brainstorming, explicit corrections, or a different project. That makes the result more informative than a simple “remember these facts” check.

Focused internal live benchmark of governed project-memory behavior; it measures update precedence, noise suppression, topic isolation, and user isolation, not general model quality or universal memory performance.

Mean governed recall

98.96%

32 live seeded cases

Memory retrieval

100%

seeded recall turns

Control leakage

32 unseeded controls

Task families

continuity, noise suppression, updates, isolation

Procedure

Each case used a fresh synthetic user. The suite seeded reviewed project context, then varied the later turn with either no intervention, rejected/noisy brainstorming, a superseding correction, or an unrelated project. The recall turn was scored only against the visible answer; an unseeded control user received the same prompt to check leakage and baseline guessing.

Governed Learning Controls

The gates are measured separately from the model.

This benchmark is intentionally deterministic: it checks whether code-path controls behave as specified before any generated answer is evaluated.

Deterministic mechanism benchmark; it verifies code-path controls and telemetry, not open-ended model answer quality.

Control checks

5/5

deterministic code-path benchmark

Pass rate

100%

all mechanism checks passed

Mechanisms

arbitration, gates, scope, compression, telemetry

Procedure

The suite runs direct code-path checks for personal-memory supersession, collective organizational insight and heuristic write gates, audience-scoped retrieval within an organization, hub-compression representative selection, and organization-scoped collective-ingestion telemetry.

Core Workflow Measurements

How we measured it

Live reasoning benchmark

Against the same-provider direct baseline, Lumenais improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.

Reasoning quality

+48.6%

0.3740 to 0.5556

Steering usefulness

0.3857

0.0125 to 0.3857

Grounding fit

100%

0.9464 to 1.0000

Sample: 56 live prompts

Measures whether the companion chooses a more useful line of thought under live conditions while staying grounded.

Average uplift across a 56-prompt suite; not a claim that every prompt improves equally.

Evidence package

Website Benchmark Suite v2

Paired same-provider direct-baseline benchmark covering 56 live reasoning prompts, with methodology and component reports retained for technical review.

live benchmarkUpdated 2026-03-29technical review

Exact correctness floor

Exact correctness held at 100% on 24 callable-backed deterministic tasks, used as a deterministic safety-floor alongside the broader reasoning benchmark.

Exact correctness

100%

100% to 100%

Sample: 24 deterministic tasks

Checks that the system preserves or improves closed-form correctness while adding reasoning scaffolding.

This is a deterministic callable-backed multiple-choice safety-floor metric, not an open-form reasoning benchmark.

Evidence package

Website Benchmark Suite v2 exactness slice

Callable-backed deterministic task slice used as a correctness safety-floor check alongside the broader reasoning suite.

offline benchmarkUpdated 2026-03-29technical review

Task selection

On the approved gold lens-family set, task-selection accuracy improved from 0.00 to 0.40 across 30 prompts.

Task selection

+0.40

0.00 to 0.40

Sample: 30 approved-gold prompts

Measures whether the system chooses a more useful reasoning family before answering.

Curated approved-gold lens-family benchmark; not an open-world classifier claim.

Evidence package

Website Benchmark Suite v2 task-selection slice

Approved-gold lens-family selection benchmark measuring whether the system chooses a useful reasoning family before answering.

offline benchmarkUpdated 2026-03-29technical review

Semantic grounding proxy

On a 16-case ambiguity-control proxy benchmark, semantic grounding landed at 1.00 artifact-class accuracy and 1.00 prompt-family accuracy.

Artifact class accuracy

1.00

Prompt-family accuracy

1.00

Sample: 16 ambiguity-control cases

Checks that ambiguous prompts stay in the right semantic universe instead of collapsing into generic or literalized readings.

Narrow ambiguity-control proxy benchmark; supporting evidence, not the headline reasoning claim.

Evidence package

Website Benchmark Suite v2 semantic-grounding slice

Focused ambiguity-control proxy benchmark measuring artifact-class and prompt-family grounding under ambiguous prompts.

offline benchmarkUpdated 2026-03-29technical review

Where it wins

Ambiguous Named Concept

0.3411 → 0.5872

When a prompt uses a poetic or metaphorical name, the system keeps it in the right conceptual frame instead of interpreting it literally.

Mathematical Strategy

0.3793 → 0.5536

On math problems, the system picks stronger proof strategies and gives clearer next-step guidance.

Operational Tradeoff

0.4115 → 0.5754

For real-world trade-off decisions, the system identifies the variable that actually matters instead of listing generic pros and cons.

UI System Design

0.3689 → 0.5782

On design problems, the system finds the real implementation decision point instead of writing a generic architecture overview.

High-Stakes Advisory

0.3370 → 0.4843

Under high-stakes or ambiguous pressure, the system provides grounded, practical strategies instead of vague reassurance.

Scientific Mechanism

0.3799 → 0.5809

On science questions, the system frames mechanisms more precisely and distinguishes between competing experimental approaches.

Ambiguous Abstract

0.4003 → 0.5293

For abstract or philosophical prompts, the system gives substantive framing instead of decorative language.

Example judgments

Clear win

Ambiguous concept framing

On prompts like “Glass Field,” the direct baseline tended to literalize the metaphor into visual-design language. Lumenais reframed the task around the hidden implementation constraint and produced a stronger first prototype direction.

Typical win

Operational tradeoff

Instead of listing generic pros and cons, the Lumenais path more often identified the hinge variable: the condition that would decide which option survives contact with implementation.

Clear miss

Scientific mechanism prompt

One stress-suite miss showed the system drifting into companion-style language when the prompt wanted a tighter retrieval-mechanism analysis. That failure is why task routing and semantic grounding remain explicit reliability targets.

Infrastructure & Governance

Cross-domain transfer

Across 150 governed-versus-baseline runs on five domain pairs, cross-domain transfer measured about 13% accuracy uplift over baseline.

Runs

150

Accuracy uplift

~+13%

Example delta

~0.79 vs ~0.66

Sample: 5 domain pairs, 150 runs

Shows that learned structure in one domain can improve adjacent domains under governance, rather than requiring per-domain retraining.

Internal UFCT governed-vs-baseline evaluation across curated domain pairs; not a consumer chat benchmark.

Evidence package

UFCT governed transfer evaluation

Internal governed-versus-baseline evaluation across curated domain pairs, summarized for public review without exposing implementation appendices.

offline benchmarkUpdated 2025-11-30under nda

Tools manifold routing

Tools manifold routing improved top-1 selection by 3.77 percentage points on real paired events and by 5.34 percentage points on the broader combined benchmark.

Real paired events

+3.77 pp

50.94% to 54.72%

Combined benchmark

+5.34 pp

Broad benchmark-scale evaluation

Sample: 53 real paired events; combined benchmark-scale evaluation

Measures whether learned routing improves tool choice compared with a fixed baseline policy.

Real-event significance remains underpowered at n=53; strongest support comes from the broader combined benchmark.

Evidence package

Tools manifold routing significance package

Paired real-event and broader combined routing benchmark measuring learned tool selection against a fixed baseline policy.

offline benchmarkUpdated 2026-02-25technical review

Manifold stability

Production manifolds validated above 91% accuracy while monitored training drift stayed within a bounded L2 range of 0.014 to 0.121.

Validation accuracy

>91%

L2 drift band

0.014–0.121

Convergence

1–13 epochs

Sample: Nine trained manifolds

Supports the claim that learning components remain stable enough to deploy under governance.

Training and validation stability evidence for manifolds, not a live companion benchmark.

Evidence package

Manifold validation and drift report

Training and validation stability evidence for production manifolds, including validation accuracy and bounded L2 drift ranges.

offline benchmarkUpdated 2025-12-01under nda

Mesh sharding speed

On a sharded synthesis workload, mesh-parallel execution achieved a 2.74x mean speedup over local execution across 10 queries.

Mean speedup

2.74x

CI95

2.66x–2.83x

Queries

Sample: 10 benchmark queries

Shows that the mesh can materially reduce wall-clock time for sharded synthesis workloads.

Measures orchestration and distributed execution speed for a specific sharded synthesis workload, not model quality.

Evidence package

Mesh synthesis sharding benchmark

Sharded synthesis workload comparison measuring wall-clock speedup for mesh-parallel execution versus local execution.

offline benchmarkUpdated 2026-01-25under nda

Research Lab

PIMA Diabetes

On the PIMA Diabetes benchmark, the research pipeline reached 85.3% AUC on 768 rows while correctly avoiding harmful transfer.

AUC

85.3%

Rows

768

Sample: 768 rows

Shows parity-level performance on a clean medical classification benchmark with governance preventing negative transfer.

Dataset-task benchmark for the research platform, not a live companion benchmark.

Evidence package

QARIN Research Lab tabular benchmark report

Dataset-task benchmark evidence for the research pipeline on PIMA Diabetes, with governance preventing harmful transfer.

offline benchmarkUpdated 2025-11-30under nda

Non-linear stress test

On the non-linear stress benchmark, the research pipeline reached 90.8% AUC, outperformed the linear baseline by 10.5%, and filtered 87% of noise columns.

AUC

90.8%

Lift vs linear baseline

+10.5%

Noise filtered

87%

Sample: 1,000 rows, 23 features

Shows autonomous signal detection and noise filtering on a deliberately difficult synthetic benchmark.

Synthetic signal-vs-noise benchmark; illustrates autonomous feature selection, not a production customer metric.

Evidence package

QARIN Research Lab non-linear stress benchmark

Synthetic signal-versus-noise benchmark measuring autonomous signal detection and noise filtering under controlled conditions.

offline benchmarkUpdated 2025-11-30under nda

Adult Census

On Adult Census, the research pipeline reached 91.1% AUC on 30,162 rows and 96 features while degrading gracefully when dynamic grouping timed out.

AUC

91.1%

Rows

30,162

Features

Sample: 30,162 rows, 96 features

Shows robustness on high-dimensional, messy, real-world tabular data.

Dataset-task benchmark for robustness and fallback behavior, not a live companion benchmark.

Evidence package

QARIN Research Lab Adult Census benchmark

High-dimensional tabular benchmark measuring robustness and graceful degradation under dynamic grouping timeouts.

offline benchmarkUpdated 2025-11-30under nda

Symbolic regression

The symbolic-regression stack recovered Kepler’s Third Law and the Rydberg Formula with perfect fit on standard benchmark tasks.

Kepler fit

R² = 1.0

Kepler complexity

4 nodes

Rydberg fit

R² = 1.0

Sample: Standard physics benchmark tasks

Shows interpretable equation discovery rather than black-box prediction alone.

Physics symbolic-regression benchmark; demonstrates the research pipeline, not the consumer companion.

Evidence package

QARIN symbolic-regression benchmark report

Physics-law recovery benchmark demonstrating interpretable equation discovery on standard symbolic-regression tasks.

offline benchmarkUpdated 2026-01-15under nda

Alzheimer’s biomarker discovery

On the GSE84422 Alzheimer’s biomarker task, the research pipeline validated at AUC 0.855 on 2,004 samples across 19 brain regions.

Validation AUC

0.855

Samples

2,004

Brain regions

Sample: 2,004 samples, 19 regions

Shows structured discovery on a real biological dataset with literature-grounded marker interpretation.

Scientific discovery benchmark on curated transcriptomics data; not a live companion eval.

Evidence package

QARIN biomedical discovery benchmark report

Curated transcriptomics benchmark and literature-grounded marker interpretation for the GSE84422 Alzheimer’s task.

offline benchmarkUpdated 2026-01-27under nda

FieldHash & Provenance

These benchmarks cover the provenance layer behind selected audited artifacts and governance decisions. For the narrative overview and evidence-path context, start with the FieldHash pages.

FieldHash Overview Strategic Brief

FieldHash hardening closure

On the measured adversarial synthesis benchmark, a standard-profile uniform-blend attack passed in 15 of 800 trials while the hardened profile closed that gap to 0 of 800.

Standard profile

15/800

1.875%

Hardened profile

0/800

Sample: 800 trials per profile

Shows that hardening materially closed a measured attack family rather than relying on a generic security narrative.

Attack-family measurement on a specific adversarial synthesis benchmark; not a universal security guarantee.

Evidence package

FieldHash adversarial hardening package

Measured adversarial synthesis benchmark comparing standard and hardened profiles against a uniform-blend attack family.

adversarial validationUpdated 2026-02-17public summary

FieldHash production-gated adaptive campaign

In the calibration-conditioned adaptive ML campaign, production-gated verification measured 0 of 5,000 successful forgeries per tested model, with a Wilson 95% upper bound of 0.0768%.

Production-gated acceptance

0/5000

Wilson 95% upper bound

0.0768%

Sample: 5,000 trials per tested model

Shows that the production-gated path held under stronger adaptive attacks than the policy-only path.

Per-tested-model result under the documented production-gated verifier and no-signing-key assumption; not an absolute impossibility claim.

Evidence package

FieldHash adaptive spoofing campaign

Calibration-conditioned adaptive ML spoofing campaign under the documented production-gated verifier and no-signing-key assumption.

adversarial validationUpdated 2026-02-17public summary

Scientific Caveats

Not AGI: These metrics measure reasoning quality and constraint adherence, not general consciousness or broad artificial general intelligence.

Sample Size: These benchmarks are broad enough to be meaningful, but they still represent evaluated slices rather than every possible workload.

Mean Lift: +48.6% reasoning lift is a mean improvement across the 56-prompt benchmark set. Individual prompts may show higher or lower improvement.

Broader diagnostic suite (v4): The May 2026 diagnostic suite measured +52.2% reasoning lift across 203 live prompts, with exactness and semantic grounding holding at 100%. It remains diagnostic because task-selection is still an active reliability target.

Scope: Exact correctness is a deterministic safety-floor check, and semantic grounding is a focused ambiguity-control proxy rather than the main headline claim.

Review the evidence.

The public evidence page summarizes the current benchmark artifacts. Organizations can request access, the whitepaper, and deeper technical review materials.

Request access

Ready to build?

This page is the evidence. The whitepaper explains the architecture behind it. The case studies show what the reasoning looks like in practice.

Read the whitepaper Request access