We tested the promise directly.
Add Lumenais on top of the same base-model provider, then compare the resulting workflow against the direct baseline. The question is simple: does the added layer produce better decisions, stronger task selection, and more grounded outputs?
What changed
Persistent memory, synthesis scaffolding, routing, and benchmarked research workflows.
What improved
Higher average reasoning quality, better first moves, and more reliable constraint adherence.
How to trust it
Same-provider comparisons, public-safe evidence labels, dates, caveats, and deeper review materials when access is appropriate.
How scoring works
Reasoning quality is internally scored by a fixed programmatic rubric over artifact fit, prompt specificity, steering usefulness, creativity signals, and generic meta-answer penalties. Exactness uses callable-backed deterministic questions. Task selection uses approved-gold lens-family labels. Semantic grounding uses curated ambiguity-control cases. No blinded human-rater claim is made here; this is a reproducible workflow-scaffolding benchmark, not a provider replacement benchmark.
How the broader diagnostic suite (v4) was run
Prompt set
203 prompts across 7 families, balanced at 29 prompts each.
Arms
Direct same-provider baseline versus the Lumenais-guided workflow.
Held constant
Provider family, prompt set, API path class, and scoring rubric.
What changed
Memory/context gating, quick-lite routing, grounding, selector behavior, and manifold posture.
Statistics
Paired wins/losses/ties, mean delta, median delta, bootstrap 95% CI, and sign test.
Promotion rule
v2 stays promoted; v4 is diagnostic because it is larger and more realistic while task selection remains active work.
The category claim has to show both adaptation and restraint.
Lumenais uses “governed continual learning” narrowly: validated outcomes can change future retrieval, routing, and memory influence, while scope, confidence, compression, and telemetry gates decide what is allowed to carry forward.
It is not base-model fine-tuning, preference-data training, plain retrieval, or context compaction. The benchmarks below separate the two halves of the claim: what gets reused, and what is blocked from becoming a future prior.
Reviewed context persists
98.96%In a 32-case live governed-memory benchmark, Lumenais recovered current reviewed project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.
Scoped memory retrieval plus correction precedence.
Bad priors are blocked
5/5A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.
Write gates, audience scope, hub compression, and telemetry checks.
Structure transfers across domains
~+13%Across 150 governed-versus-baseline runs on five domain pairs, cross-domain transfer measured about 13% accuracy uplift over baseline.
Domain manifolds blend through a governed bus.
Workflow quality improves
+48.6%Against the same-provider direct baseline, Lumenais improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.
The same-provider baseline isolates the Lumenais layer.
Rejected-learning control example
The governed-memory suite includes later turns that introduce discarded brainstorming, superseding corrections, and unrelated project context. Passing behavior means the system can preserve reviewed work while preventing stale, noisy, or out-of-scope material from becoming the answer’s hidden prior.
The promoted public suite.
The current public claim uses the v2 suite because it cleanly supports the headline reasoning, exactness, task-selection, and semantic-grounding claims together.
Reasoning quality
+48.6%
0.3740 to 0.5556
Grounding fit
100%
0.9464 to 1.0000
Task selection
+0.40
0.00 to 0.40
Exact correctness
100%
100% to 100%
The larger suite makes the reasoning claim more believable, not less.
In the broader diagnostic suite (v4), Lumenais measured a 52.2% relative reasoning lift with 155 wins, 5 losses, and 43 ties; exact correctness held at 100% on 24 deterministic tasks, and semantic grounding held at 100% across 32 ambiguity-control cases.
Broader diagnostic suite, not yet the promoted public headline; it preserves the reasoning lift at larger sample size while showing task-selection remains an active reliability target.
Reasoning lift
+52.2%
0.3509 to 0.5340
Paired wins
155/5/43
wins / losses / ties
Mean-delta CI95
[0.1668, 0.1976]
Exact correctness
100%
24 deterministic tasks
Semantic grounding
100%
32 ambiguity-control cases
Task selection
38.98%
0.0000 to 0.3898
What this tells us
The diagnostic suite is useful precisely because it is broader: the paired reasoning lift persisted at larger sample size, exactness held, and semantic grounding held after repair. The remaining softness is narrow and visible: task-selection stayed slightly below the smaller promoted suite, so the headline remains pinned to v2 while the diagnostic suite gives reviewers the more realistic stress picture. For where the lift concentrates, see Mechanism Ablation below.
The lift comes from the quick-lite reasoning path.
A router-conditioned ablation of the broader diagnostic suite (v4) shows the lift is concentrated in quick-lite-routed rows: 164 quick-lite prompts improved from 0.3507 to 0.5777 with 155 wins, 2 losses, and 7 ties, while 39 direct-high-retained rows were essentially flat.
Router-conditioned ablation, not a randomized forced-routing experiment; prompt difficulty may differ between quick-lite-eligible and direct-high-retained rows.
Quick-lite rows
164
0.3507 to 0.5777
Quick-lite wins
155/2/7
wins / losses / ties
Quick-lite mean delta
+0.2270
Direct-high rows
39
0.3516 to 0.3500
Memory governance, not just recall.
In a 32-case live governed-memory benchmark, Lumenais recovered current reviewed project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.
The test is intentionally narrow: it asks whether the system keeps the current reviewed project facts straight when later turns include discarded brainstorming, explicit corrections, or a different project. That makes the result more informative than a simple “remember these facts” check.
Focused internal live benchmark of governed project-memory behavior; it measures update precedence, noise suppression, topic isolation, and user isolation, not general model quality or universal memory performance.
Mean governed recall
98.96%
32 live seeded cases
Memory retrieval
100%
seeded recall turns
Control leakage
0%
32 unseeded controls
Task families
4
continuity, noise suppression, updates, isolation
Procedure
Each case used a fresh synthetic user. The suite seeded reviewed project context, then varied the later turn with either no intervention, rejected/noisy brainstorming, a superseding correction, or an unrelated project. The recall turn was scored only against the visible answer; an unseeded control user received the same prompt to check leakage and baseline guessing.
The gates are measured separately from the model.
A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.
This benchmark is intentionally deterministic: it checks whether code-path controls behave as specified before any generated answer is evaluated.
Deterministic mechanism benchmark; it verifies code-path controls and telemetry, not open-ended model answer quality.
Control checks
5/5
deterministic code-path benchmark
Pass rate
100%
all mechanism checks passed
Mechanisms
5
arbitration, gates, scope, compression, telemetry
Procedure
The suite runs direct code-path checks for personal-memory supersession, collective organizational insight and heuristic write gates, audience-scoped retrieval within an organization, hub-compression representative selection, and organization-scoped collective-ingestion telemetry.
How we measured it
Live reasoning benchmark
Against the same-provider direct baseline, Lumenais improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.
Reasoning quality
+48.6%
0.3740 to 0.5556
Steering usefulness
0.3857
0.0125 to 0.3857
Grounding fit
100%
0.9464 to 1.0000
Sample: 56 live prompts
Measures whether the companion chooses a more useful line of thought under live conditions while staying grounded.
Average uplift across a 56-prompt suite; not a claim that every prompt improves equally.
Evidence package
Website Benchmark Suite v2
Paired same-provider direct-baseline benchmark covering 56 live reasoning prompts, with methodology and component reports retained for technical review.
Exact correctness floor
Exact correctness held at 100% on 24 callable-backed deterministic tasks, used as a deterministic safety-floor alongside the broader reasoning benchmark.
Exact correctness
100%
100% to 100%
Sample: 24 deterministic tasks
Checks that the system preserves or improves closed-form correctness while adding reasoning scaffolding.
This is a deterministic callable-backed multiple-choice safety-floor metric, not an open-form reasoning benchmark.
Evidence package
Website Benchmark Suite v2 exactness slice
Callable-backed deterministic task slice used as a correctness safety-floor check alongside the broader reasoning suite.
Task selection
On the approved gold lens-family set, task-selection accuracy improved from 0.00 to 0.40 across 30 prompts.
Task selection
+0.40
0.00 to 0.40
Sample: 30 approved-gold prompts
Measures whether the system chooses a more useful reasoning family before answering.
Curated approved-gold lens-family benchmark; not an open-world classifier claim.
Evidence package
Website Benchmark Suite v2 task-selection slice
Approved-gold lens-family selection benchmark measuring whether the system chooses a useful reasoning family before answering.
Semantic grounding proxy
On a 16-case ambiguity-control proxy benchmark, semantic grounding landed at 1.00 artifact-class accuracy and 1.00 prompt-family accuracy.
Artifact class accuracy
1.00
Prompt-family accuracy
1.00
Sample: 16 ambiguity-control cases
Checks that ambiguous prompts stay in the right semantic universe instead of collapsing into generic or literalized readings.
Narrow ambiguity-control proxy benchmark; supporting evidence, not the headline reasoning claim.
Evidence package
Website Benchmark Suite v2 semantic-grounding slice
Focused ambiguity-control proxy benchmark measuring artifact-class and prompt-family grounding under ambiguous prompts.
Where it wins
Ambiguous Named Concept
0.3411 → 0.5872
When a prompt uses a poetic or metaphorical name, the system keeps it in the right conceptual frame instead of interpreting it literally.
Mathematical Strategy
0.3793 → 0.5536
On math problems, the system picks stronger proof strategies and gives clearer next-step guidance.
Operational Tradeoff
0.4115 → 0.5754
For real-world trade-off decisions, the system identifies the variable that actually matters instead of listing generic pros and cons.
UI System Design
0.3689 → 0.5782
On design problems, the system finds the real implementation decision point instead of writing a generic architecture overview.
High-Stakes Advisory
0.3370 → 0.4843
Under high-stakes or ambiguous pressure, the system provides grounded, practical strategies instead of vague reassurance.
Scientific Mechanism
0.3799 → 0.5809
On science questions, the system frames mechanisms more precisely and distinguishes between competing experimental approaches.
Ambiguous Abstract
0.4003 → 0.5293
For abstract or philosophical prompts, the system gives substantive framing instead of decorative language.
Example judgments
Clear win
Ambiguous concept framing
On prompts like “Glass Field,” the direct baseline tended to literalize the metaphor into visual-design language. Lumenais reframed the task around the hidden implementation constraint and produced a stronger first prototype direction.
Typical win
Operational tradeoff
Instead of listing generic pros and cons, the Lumenais path more often identified the hinge variable: the condition that would decide which option survives contact with implementation.
Clear miss
Scientific mechanism prompt
One stress-suite miss showed the system drifting into companion-style language when the prompt wanted a tighter retrieval-mechanism analysis. That failure is why task routing and semantic grounding remain explicit reliability targets.
Infrastructure & Governance
Cross-domain transfer
Across 150 governed-versus-baseline runs on five domain pairs, cross-domain transfer measured about 13% accuracy uplift over baseline.
Runs
150
Accuracy uplift
~+13%
Example delta
~0.79 vs ~0.66
Sample: 5 domain pairs, 150 runs
Shows that learned structure in one domain can improve adjacent domains under governance, rather than requiring per-domain retraining.
Internal UFCT governed-vs-baseline evaluation across curated domain pairs; not a consumer chat benchmark.
Evidence package
UFCT governed transfer evaluation
Internal governed-versus-baseline evaluation across curated domain pairs, summarized for public review without exposing implementation appendices.
Tools manifold routing
Tools manifold routing improved top-1 selection by 3.77 percentage points on real paired events and by 5.34 percentage points on the broader combined benchmark.
Real paired events
+3.77 pp
50.94% to 54.72%
Combined benchmark
+5.34 pp
Broad benchmark-scale evaluation
Sample: 53 real paired events; combined benchmark-scale evaluation
Measures whether learned routing improves tool choice compared with a fixed baseline policy.
Real-event significance remains underpowered at n=53; strongest support comes from the broader combined benchmark.
Evidence package
Tools manifold routing significance package
Paired real-event and broader combined routing benchmark measuring learned tool selection against a fixed baseline policy.
Manifold stability
Production manifolds validated above 91% accuracy while monitored training drift stayed within a bounded L2 range of 0.014 to 0.121.
Validation accuracy
>91%
L2 drift band
0.014–0.121
Convergence
1–13 epochs
Sample: Nine trained manifolds
Supports the claim that learning components remain stable enough to deploy under governance.
Training and validation stability evidence for manifolds, not a live companion benchmark.
Evidence package
Manifold validation and drift report
Training and validation stability evidence for production manifolds, including validation accuracy and bounded L2 drift ranges.
Mesh sharding speed
On a sharded synthesis workload, mesh-parallel execution achieved a 2.74x mean speedup over local execution across 10 queries.
Mean speedup
2.74x
CI95
2.66x–2.83x
Queries
10
Sample: 10 benchmark queries
Shows that the mesh can materially reduce wall-clock time for sharded synthesis workloads.
Measures orchestration and distributed execution speed for a specific sharded synthesis workload, not model quality.
Evidence package
Mesh synthesis sharding benchmark
Sharded synthesis workload comparison measuring wall-clock speedup for mesh-parallel execution versus local execution.
Research Lab
PIMA Diabetes
On the PIMA Diabetes benchmark, the research pipeline reached 85.3% AUC on 768 rows while correctly avoiding harmful transfer.
AUC
85.3%
Rows
768
Sample: 768 rows
Shows parity-level performance on a clean medical classification benchmark with governance preventing negative transfer.
Dataset-task benchmark for the research platform, not a live companion benchmark.
Evidence package
QARIN Research Lab tabular benchmark report
Dataset-task benchmark evidence for the research pipeline on PIMA Diabetes, with governance preventing harmful transfer.
Non-linear stress test
On the non-linear stress benchmark, the research pipeline reached 90.8% AUC, outperformed the linear baseline by 10.5%, and filtered 87% of noise columns.
AUC
90.8%
Lift vs linear baseline
+10.5%
Noise filtered
87%
Sample: 1,000 rows, 23 features
Shows autonomous signal detection and noise filtering on a deliberately difficult synthetic benchmark.
Synthetic signal-vs-noise benchmark; illustrates autonomous feature selection, not a production customer metric.
Evidence package
QARIN Research Lab non-linear stress benchmark
Synthetic signal-versus-noise benchmark measuring autonomous signal detection and noise filtering under controlled conditions.
Adult Census
On Adult Census, the research pipeline reached 91.1% AUC on 30,162 rows and 96 features while degrading gracefully when dynamic grouping timed out.
AUC
91.1%
Rows
30,162
Features
96
Sample: 30,162 rows, 96 features
Shows robustness on high-dimensional, messy, real-world tabular data.
Dataset-task benchmark for robustness and fallback behavior, not a live companion benchmark.
Evidence package
QARIN Research Lab Adult Census benchmark
High-dimensional tabular benchmark measuring robustness and graceful degradation under dynamic grouping timeouts.
Symbolic regression
The symbolic-regression stack recovered Kepler’s Third Law and the Rydberg Formula with perfect fit on standard benchmark tasks.
Kepler fit
R² = 1.0
Kepler complexity
4 nodes
Rydberg fit
R² = 1.0
Sample: Standard physics benchmark tasks
Shows interpretable equation discovery rather than black-box prediction alone.
Physics symbolic-regression benchmark; demonstrates the research pipeline, not the consumer companion.
Evidence package
QARIN symbolic-regression benchmark report
Physics-law recovery benchmark demonstrating interpretable equation discovery on standard symbolic-regression tasks.
Alzheimer’s biomarker discovery
On the GSE84422 Alzheimer’s biomarker task, the research pipeline validated at AUC 0.855 on 2,004 samples across 19 brain regions.
Validation AUC
0.855
Samples
2,004
Brain regions
19
Sample: 2,004 samples, 19 regions
Shows structured discovery on a real biological dataset with literature-grounded marker interpretation.
Scientific discovery benchmark on curated transcriptomics data; not a live companion eval.
Evidence package
QARIN biomedical discovery benchmark report
Curated transcriptomics benchmark and literature-grounded marker interpretation for the GSE84422 Alzheimer’s task.
FieldHash & Provenance
These benchmarks cover the provenance layer behind selected audited artifacts and governance decisions. For the narrative overview and evidence-path context, start with the FieldHash pages.
FieldHash hardening closure
On the measured adversarial synthesis benchmark, a standard-profile uniform-blend attack passed in 15 of 800 trials while the hardened profile closed that gap to 0 of 800.
Standard profile
15/800
1.875%
Hardened profile
0/800
Sample: 800 trials per profile
Shows that hardening materially closed a measured attack family rather than relying on a generic security narrative.
Attack-family measurement on a specific adversarial synthesis benchmark; not a universal security guarantee.
Evidence package
FieldHash adversarial hardening package
Measured adversarial synthesis benchmark comparing standard and hardened profiles against a uniform-blend attack family.
FieldHash production-gated adaptive campaign
In the calibration-conditioned adaptive ML campaign, production-gated verification measured 0 of 5,000 successful forgeries per tested model, with a Wilson 95% upper bound of 0.0768%.
Production-gated acceptance
0/5000
Wilson 95% upper bound
0.0768%
Sample: 5,000 trials per tested model
Shows that the production-gated path held under stronger adaptive attacks than the policy-only path.
Per-tested-model result under the documented production-gated verifier and no-signing-key assumption; not an absolute impossibility claim.
Evidence package
FieldHash adaptive spoofing campaign
Calibration-conditioned adaptive ML spoofing campaign under the documented production-gated verifier and no-signing-key assumption.
Scientific Caveats
Not AGI: These metrics measure reasoning quality and constraint adherence, not general consciousness or broad artificial general intelligence.
Sample Size: These benchmarks are broad enough to be meaningful, but they still represent evaluated slices rather than every possible workload.
Mean Lift: +48.6% reasoning lift is a mean improvement across the 56-prompt benchmark set. Individual prompts may show higher or lower improvement.
Broader diagnostic suite (v4): The May 2026 diagnostic suite measured +52.2% reasoning lift across 203 live prompts, with exactness and semantic grounding holding at 100%. It remains diagnostic because task-selection is still an active reliability target.
Scope: Exact correctness is a deterministic safety-floor check, and semantic grounding is a focused ambiguity-control proxy rather than the main headline claim.
Review the evidence.
The public evidence page summarizes the current benchmark artifacts. Organizations can request access, the whitepaper, and deeper technical review materials.
Request accessReady to build?
This page is the evidence. The whitepaper explains the architecture behind it. The case studies show what the reasoning looks like in practice.