mnemo · live research dashboard

The Memento-Skills loop, live in production.

Memento-Skills (Zhou et al. 2026, arXiv:2603.18743) introduced a Read-Write Reflective Learning loop in which executable skills serve as evolving agent memory. The original paper validates the loop on a single LLM provider against benchmark suites. We replicate it under multi-provider, multi- tenant production traffic, and we report the numbers as they happen.

Pre-registered (locked at study start). GDPR-anonymised. Replication package open-sourced. 28-day window. 8 providers. 15 personas. Real workloads from real tenants — never named, never identified.

Read the pre-registration Contact research

Study starts in 3 days · captured 28 Apr, 00:55 UTC

Live aggregates — k-anonymity ≥ 5, no tenant names

The substrate

15 personas, 110 skills, 8 providers — one running brain.

Every counter on this page is a real number from production. Once the study window opens (2026-05-01) the numbers move every 10 minutes. There are no tenant names anywhere on this dashboard, only their aggregate footprint — GDPR Art. 6(1)(f) plus an opt-out switch in every Settings panel.

Global personas

C-O-R-S framework, native disciplines or universal Layer 2.5

Active skills

110

Across the persona library, evolution-loop ready

Active tenants

Hashed in every export, never named

Skill executions

Real production calls since loop went live (2026-04-28)

Discussions total

165

Lifetime, all tenants combined

Workflows total

249

Lifetime, all tenants combined

Discovery candidates

Heuristica triad output awaiting human review

Skill drafts

LLM-evolved prompts pending Apply / Discard

Pre-registered questions — H1 through H4

The four hypotheses

Locked before the data flows. No post-hoc shopping.

The pre-registration enumerates four primary hypotheses, the metrics that would confirm or refute each, the stop criteria, and a list of analyses we may NOT claim because we did not pre-commit to them. Reviewers can verify deviations against the locked Git tag.

H1 — Does the loop actually learn?

Skills with active Read-Write Reflective Learning loops should show non-decreasing utility-score trajectories over 28 days. We test it on real production traffic — not a benchmark, not a sandbox.

H2 — Do skills transfer across providers?

When a skill prompt is evolved using Anthropic Opus 4.7 as the judge, does it perform non-inferior under OpenAI GPT-5, Google Gemini, Kimi-K2, xAI Grok, DeepSeek, OpenRouter, Ollama? Memento answered this for one provider. We answer it for eight.

H3 — Does adversarial inversion add signal?

Heuristica triad — Ipcha (adversarial) + Metis (synthesis) + Cael (judge) — should produce skill-discovery candidates with measurably higher human-adoption rate than a Metis-only baseline. Constitutional self-critique in production.

H4 — Does native Constitutional layer beat universal?

Personas with natively integrated Ipcha self-check (Aristaeus, Metis, Nefilibata) should show fewer hallucination flags than personas relying on the universal Layer 2.5 disciplines, on matched-domain queries.

The loop — Memento Algorithm 1, in code

Read → Execute → Judge → Write

Every skill that fires writes back what happened.

The Read step retrieves the best matching skill via hybrid pgvector + tsvector search. Execute calls the user's chosen provider. Judge classifies success or failure (currently keyword classifier; LLM-judge ships pre-lock). Write updates the skill's utility score, queues an evolution draft when the threshold trips, or escalates to Heuristica discovery on a fresh failure pattern.

1 · Read

findBestSkill — hybrid embedding + simple-tokenizer text rank, MIN_SCORE 0.2, utility-weighted

2 · Execute

provider.complete(...) — Anthropic, OpenAI, Google, Kimi, xAI, DeepSeek, OpenRouter, Ollama

3 · Judge

classifyImplicitFeedback (keyword) + LLM-judge (Haiku 4.5, parallel) — agreement reported

4 · Write

recordSkillFeedback — utility = succ/(succ+fail), provider/model/cost attribution

5 · Evolve

createEvolutionDraft (Cael unit-test gate) OR Heuristica triad discovery — Ipcha + Metis + Cael

Provider mix — execution share

Eight providers. One contract.

Tenants bring their own keys (BYOK). Skills evolve under whichever judge the tenant prefers, then run under whichever executor a downstream call selects. H2 is exactly this question.

anthropic100%

Other providers will populate as users route real workloads. Cold-start period.

Recent loop activity

Pulse log.

Most recent skill matches, evolutions, and discovery candidates. Persona and skill names only — no tenant attribution, no message content.

28 Apr, 00:11skill matchedDesign Pattern Selection AdvisorAthena

27 Apr, 23:50version queuedArchitecture Decision Record (ADR) AuthoringAthena

27 Apr, 23:42candidate emergedArchitecture Pattern ExplainerAthena

The persona library — 15 voices, six categories

The Pantheon, live

Each persona has its own evolving skill set — none of them static.

Every persona ships with five seed skills (Nefilibata is the most recent — added 2026-04-27, still hydrating). Once a persona is invoked, its skills accumulate utility data. Personas with native Constitutional self-check (Aristaeus, Metis, Nefilibata) carry their own Ipcha-inversion routine; the rest inherit it from the universal Layer 2.5 — H4 measures the difference.

Metis

System Intelligence

5 skills

Aristaeus

System Intelligence

5 skills

Aletheia

Compliance

5 skills

Nemesis

Engineering

5 skills

Harmonia

Design

5 skills

Cael

System Intelligence

5 skills

Ipcha Mistabra

System Intelligence

5 skills

Hermes

Marketing

5 skills

Athena

Engineering

5 skills

NyxCore

System Intelligence

5 skills

Themis

Compliance

5 skills

Prometheus

Compliance

5 skills

Tyche

Marketing

5 skills

Clotho

Lifestyle

5 skills

Nefilibata

System Intelligence

0 skills

Full persona spread: the pantheon →

Investigators — three roles, one study

Who runs this

Architecture, skill memory, data sovereignty — three named people behind every claim.

This is not an anonymous lab study. The pre-registration, the GDPR balancing test, and the locking statement carry three signatures. Lisa wrote the chassis the loop rides on and the morphone theory that powers the §5.2 exploratory analysis. Martyna designed the anonymisation pipeline and authored the GDPR assessment. Oliver runs the platform and the study.

Primary investigator

Oliver Baer

nyxCore Systems · Founder & Chief Architect

Platform architecture, study design, hypothesis specification, primary author of the pre-registration. Voice of Aristaeus.

Co-investigator — skill memory & morphone theory

Lisa Welsch

nyxCore Systems · CKB · Core Systems Engineer

Skill-memory chassis, the §5.2 morphone-detection-exploratory analysis derives from her work on Morphogenic Intelligence (Welsch 2024). Voice of Metis.

Co-investigator — GDPR & anonymisation

Martyna Kwiecień

nyxCore Systems · claritas-ai-consulting · Data Sovereignty Counsel

Designed the anonymisation pipeline (§4.3, k-anonymity ≥ 5, no free text), authored the GDPR Art. 6(1)(f) balancing test, owns Art. 32 tenant isolation. Voice of Themis.

Methodology — open by default

No black box

Pre-registered, balancing-tested, replication-packaged. Read everything.

Three documents are published before any data flows. The pre-registration locks the four hypotheses, the metrics, the stop criteria, and a list of analyses we may NOT claim. The GDPR balancing test documents the legal basis under Art. 6(1)(f). The replication repository (open after study end) carries the anonymised dataset and analysis scripts so anyone can reproduce every figure.

Pre-Registration v1.0

4 hypotheses, study design, anonymisation pipeline, stop criteria, replication promise. Locked at the Git tag below at the start of data collection (2026-05-01).

GDPR balancing test

Art. 6(1)(f) legitimate-interest assessment. Documents the alternatives considered, the mitigations applied, and the explicit data-subject-rights preserved.

Replication package

MIT-licensed. Anonymised dataset, analysis scripts (Python + R), schema-and-pipeline freeze, deviation log. Released at study end.

Why this matters

The open question Memento leaves

Skills evolve in a lab. But do they survive the wild?

Memento-Skills validates the loop with Gemini-3.1-Flash on benchmark Q&A. Three things stay open: cross-provider transfer (do skills evolved by Anthropic perform under OpenAI?), production workloads (no ground truth, only implicit user feedback), and Constitutional self-critique (does adversarial inversion add real signal?). nyxCore answers all three, in public.

What we may claim

→ that the prompt-evolution branch of the Read-Write loop exhibits the qualitative trajectories Memento predicts (H1)
→ that evolved prompts retain non-inferior outcome distribution across non-evolver providers (H2)
→ that the Heuristica triad outperforms a Metis-only ablation in candidate adoption rate (H3)
→ that natively integrated Constitutional self-check reduces hallucination flags (H4)

What we explicitly may not claim

→ that Memento-Skills as a complete system is replicated
→ findings about (5b) tip memory or (5d) feedback retry — not implemented, explicitly deferred
→ findings about code-executing skills — nyxCore implements prompt-only persona-skills
→ findings about behaviour-similar (Phase 4 InfoNCE) skill routing — semantic similarity only
→ cross-tenant skill transfer — out of scope

Privacy — the short answer

GDPR

What you see on this page is anonymised. What you don’t see is by design.

This page renders aggregated research metrics from the nyxCore production system. No tenant names, no user identities, no content from discussions, workflows, or skill prompts is displayed or exported. Every metric is gated by k-anonymity (k ≥ 5) and a minimum sample size of 10 data points per cell.

What we collect · what we process

→ Counters: number of skills, number of executions, success/failure counts per skill, utility score, provider/model, token and cost aggregates.
→ Outcomes: categorical result per skill execution (success / failure / neutral), timestamps coarsened to hour-of-day or day-of-study, source (workflow or discussion).
→ Provenance: which LLM provider and which model carried out the execution — relevant for the cross-provider transfer hypothesis (H2).

What we do not collect or display

→ No tenant names. Tenant IDs are salt-hashed to 8-character HMAC strings; the salt is purged at study end and is never exported.
→ No user identities. All userId columns are dropped before export. Per-user counts only appear when n ≥ 5.
→ No free text. Skill prompts, discussion content, evidence snippets, notes, search queries — all stay on the production server. This page shows only counters and categorical outcomes.
→ No project, repo, discussion, or workflow IDs. Replaced in the export with sequential synthetic IDs that preserve relational structure without back-references.
→ No analytics provider, no marketing or tracking cookies, no third-party pixels on this page. Share buttons (LinkedIn, X) only emit data when you actively click them — no transmission to those platforms happens otherwise. An HTTP access log is kept for operational reasons; whether the IP address is truncated in that log depends on the reverse-proxy configuration — see /privacy for the binding privacy policy.

Legal basis

Collection: GDPR Art. 6(1)(b) (contract performance — the platform writes these counters as part of normal operation).
Secondary use for research: GDPR Art. 6(1)(f) (legitimate interest), weighed in the publicly inspectable balancing test.

Retention & deletion

Daily snapshots are retained for 90 days and then dropped automatically. The HMAC salt used to pseudonymise tenant IDs is deleted from production at the close of the study window (2026-05-29). After that point the aggregate data is anonymous under GDPR Art. 4(1) and no longer personal data.

Your rights as a tenant

Opt-out: Settings → Privacy → Research participation. Effective immediately; data is excluded from future snapshots and any already-captured snapshot is purged of the opted-out tenant within 24 hours.
Right of access (Art. 15): a list of the fields collected for your tenant while the salt mapping still exists. Send a request to research@nyxcore.cloud.
Right to lodge a complaint: with the competent supervisory authority under GDPR Art. 77.

Re-identification check

Before every export, a separate verification routine (scripts/study/k-anonymity-check.sql) checks that every reported aggregation cell has k ≥ 5 underlying tenants. Cells below that threshold are folded into a residual “small-tenants” bucket. If the check fails, no data is exported. The check output is part of the published replication package.

Concretely: before any statistic reaches this page, the data has been anonymised, k-anonymity-verified, and reduced to the aggregation level. If you suspect that any displayed number violates this guarantee, write to us — we take the affected cell offline until the issue is resolved.

Full nyxCore landing-page privacy notice: /privacy · Imprint: /imprint

Follow · share · contact

It's running. Watch.

This page updates every 10 minutes. No login. No sign-up. No pixel.

If you're a researcher, an operator, or a curious skeptic — pin the URL. The dataset goes public on 2026-06-29 with the arXiv preprint. Until then, watch the numbers move. If you're working on adjacent questions and want to talk methodology before we lock more of it, write us.

research@nyxcore.cloud Share on LinkedIn Share on X