998 Scoring Dimensions. 254 Tests. Your Hardware.

Everything AiBenchLab does, organized by category. No marketing — just what's built, what tier it requires, and what it does.

998

Scoring Dimensions

998 ways to prove an AI model can — or can't — do the job

254

Tests

Across 11 domains from reasoning to adversarial safety

Providers

Built-in llama.cpp, Ollama, LM Studio, LocalAI, OpenAI, Anthropic, Gemini, Grok, Groq + custom

Suites

Pre-built evaluation suites across 4 categories

Full Feature List

All tiers Pro+ Consultant+ Enterprise

Testing Engine

998 scoring dimensions across 254 tests and 11 domains — Easy to Edge Case difficulty, covering reasoning, coding, chat, multimodal, tool calling, agentic, safety, and more.

All

22 pre-built evaluation suites — Production, role-specific, comparison, and app-specific suites. Trial includes 2.

Pro+

8-step guided wizard — AI-powered intent classification matches your task to the right evaluation disciplines automatically.

All

Composite scoring — Multi-dimensional score across correctness, efficiency, safety, and clarity.

All

Position-weighted scoring — Catches models that game test order by weighting position in the response.

All

Hallucination penalty — Built into every text-based evaluation. Factual accuracy is not optional.

All

Deterministic evaluator — Same input, same score, every time. No variance in evaluation.

All

Safety-first scoring — Deployment risk and adversarial domains scored separately so critical failures surface immediately.

All

Add/remove tests from suite — Customize which tests run before execution.

Pro+

Create custom suites — Build suites from scratch with domain weights and pass thresholds.

Consultant+

Save As New Suite — Save customized configurations as reusable suite templates.

Consultant+

Performance Metrics

TTFT (Time to First Token) — How fast the model starts responding.

Pro+

TPOT (Time Per Output Token) — Per-token generation speed.

Pro+

TPS (Tokens Per Second) — Overall throughput.

Pro+

E2E Latency — Total time from prompt to complete response.

Pro+

Per-domain breakdown — Scores per domain in results view.

All

Individual test results — Expandable per-test detail with scoring criteria.

All

Model comparison — Side-by-side comparison of up to 15 models.

Pro+

Speed vs accuracy scatter plot — Visual tradeoff analysis across models.

Pro+

AI-generated recommendations — Pre-benchmark model recommendations based on questionnaire.

Pro+

Benchmark history — Full session history with search, filter, and pagination.

Pro+

GPU/VRAM monitoring — Live temperature, load, memory with sparkline charts.

All

Thermal protection — Auto-pause between models at configurable temp threshold. Emergency abort at 95°C.

All

Providers

Ollama — Auto-detected at localhost:11434.

All

LM Studio — OpenAI-compatible at localhost:1234.

All

LocalAI — OpenAI-compatible at localhost:8080.

All

Built-in llama.cpp — Managed server with bundled quick-start model.

All

OpenAI — Cloud API via generic OpenAI handler.

All

Anthropic — Dedicated handler for Claude models.

All

Google Gemini — Dedicated handler.

All

xAI (Grok) — OpenAI-compatible.

All

Groq — OpenAI-compatible.

All

Custom OpenAI-compatible — Any endpoint that speaks the OpenAI API format.

All

Reports & Exports

PDF single-session report — Executive summary, per-domain breakdown, individual test details.

All*

PDF comparison report — 10+ configurable sections, scatter plots, cost analysis.

Pro+

JSON export — Full results with timestamps and hardware metadata.

Pro+

CSV export — 23 columns including forensic metadata.

Pro+

MBX signed export — Tamper-evident with SHA-256 content hash and hardware fingerprint.

All

Batch export (ZIP) — Multiple sessions in one archive. CLI/API only.

Pro+

Report section customization — 10 toggleable sections in comparison reports.

Pro+

Branding & White-Label

Company name & logo on reports — PNG/JPG/SVG, max 2MB.

Pro+

Tagline and contact info — Appears on report cover page.

Pro+

Custom report footer — Your text on every page.

Pro+

Remove AiBenchLab branding — Full white-label reports for client delivery.

Consultant+

Security & Integrity

Model fingerprinting (SHA-256) — Hardware + file fingerprinting, no PII.

Pro+

GGUF metadata extraction — Native binary parser, 20+ metadata fields.

Pro+

Context window testing (MRCR) — Distributed Anchor Recovery for measuring effective context length.

Pro+

Plugin management — Enable/disable third-party test plugins.

Consultant+

Custom test creation — JSON/YAML definitions with judge model support and prompt templating.

Enterprise

Interfaces

GUI Desktop (Tauri) — Svelte frontend + Rust backend. Full UI with wizard, results, catalog.

All

CLI interface — Headless: run, export, list-models, estimate-cost.

Pro+

REST API — Axum-based with token auth. /health, /models, /benchmark, /export.

Consultant+

MCP Server — 11 tools via stdio: list_models, run_benchmark, export_report, etc.

Consultant+

Licensing & Support

License duration — 14-day trial, then lifetime for paid tiers.

All

Seats — 1 (Trial/Pro), 3 (Consultant), Site license per location (Enterprise).

All

12 months of updates included — 12 months of updates included with every license purchase.

Pro+

Annual update subscription — Optional renewal for continued updates — version upgrades, new suites, new providers, security patches.

Pro+

Priority email support — Priority email support.

Pro

Priority + direct support — Priority queue with direct access.

Consultant

Dedicated support — Named support contact.

Enterprise

Founding 100 direct channel — Direct access to the developer for first 100 customers.

Pro+

* Trial PDF reports are watermarked with redacted test details.

Screenshots

Don't pay for another AI model until you know it can do the job.

Join the waitlist to be first in line when the free trial launches.

Join the Waitlist

View Pricing