$ cd ../projects# back to list
drwxr-xr-x|[ongoing]|published

agzamov-test/

Published benchmark with DOI

Do AI augmentations actually work? Chess960 + poker measure the real delta.

$ cat ./README.md

A dual-axis adversarial benchmark that measures whether AI augmentations (memory, tools, RAG, orchestration) produce measurably better behavior under pressure. Two agents play repeated games in Chess960 (complete information) and poker (incomplete information) across four augmentation levels. The result is a single number: the Agzamov Score (0-100). Phase 0 validated with Claude Sonnet 4 — 30 games, 96.7% win rate, 0 errors. Published on Zenodo with DOI.

$ cat ./features.txt

>Dual-axis: Chess960 (complete info) + No-Limit Hold'em (incomplete info)

>Agzamov Delta (Δₐ) — measures real performance gap from augmentation

>4 augmentation levels: Naked, Memory, Tools, Full Stack

>6 phases: sanity → baseline → augmented vs naked → arms race → full stack → stress test

>Post-game Stockfish analysis with CPL, GQI, blunder tracking

>3-model comparison: Sonnet 4.6, Opus 4.5, Opus 4.6 (thinking) — EC Gap confirmed architectural

>Domain extrapolation framework: chess-domains vs poker-domains

>Published paper on Zenodo (DOI: 10.5281/zenodo.18771523)

$ cat ./stack.txt

Python / Stockfish / Claude API / Zenodo

Interested in this project?

>_contact