Skip to content
shadowiq
Pillar 02 · Evaluate

Red-team every model. Score every vendor. Map every framework.

A generative system that scored clean last month can fail next Tuesday. ShadowIQ runs 70+ evaluations continuously — on your models, your vendors, and every prompt template you deploy.

How it fits · Signature 0xEV4L-0042

Evals that run on merge, on deploy, and at 3am every night.

MODELS UNDER EVALgpt-4o-prodclaude-3.5-supportinternal/risk-v3bedrock/titanagent/ops-helperEvaluation engine2,400+ PROMPTS · QUORUM JUDGES83SCOREPER-CATEGORY · CROSSWALKEDInjection resistance94%PII leakage88%Toxicity97%Demographic parity71%OOD robustness82%Memorization45%Tool hijack91%Refusal calibration76%
How it works

Three moves, fully automated.

No long onboarding, no hand-rolled detection rules. ShadowIQ ships with defaults tuned to the regulatory floor — customize only where your risk appetite demands.

1

Pick a baseline.

Starter packs for safety (toxicity, jailbreaks, injection resistance), fairness (demographic parity, equalized odds), robustness (OOD, adversarial), and privacy (PII leakage, memorization).

2

Attach to your pipeline.

CI hook blocks unsafe merges. Nightly scheduler runs full regressions. Drift alerts fire to Slack, PagerDuty, and ServiceNow with root-cause diffs.

3

Score continuously, report automatically.

Every pass/fail is mapped to its regulatory clause. Your weekly board slide generates itself — and the evidence is already signed.

Capabilities · complete coverage

Every control a regulator or auditor will ask about.

Safety

Red-team suite

2,400+ adversarial prompts across injection, jailbreak, hate, self-harm, CSAM refusal, and tool-use hijack. Updated weekly.

Fairness

Demographic audits

NYC LL-144-ready bias audits. Group fairness, intersectional metrics, and counterfactual testing with your data.

Robustness

OOD & adversarial

Distribution shift, perturbation suites, typographic attacks, and agentic loop detection for long-horizon evaluations.

Privacy

Leakage & memorization

Probe for training-data leakage, PII memorization, and fine-tune overfit to customer records.

Vendor risk

Third-party AI scoring

Supplier questionnaire + automated signals produce a quantitative vendor risk score. Reviewed by Legal in ServiceNow.

Custom

Your evals, your data

Upload a dataset, write a rubric, generate a score. LLM-as-judge with quorum + human spot-check.

Continuous

Scheduler + drift

Nightly runs, per-PR gating, and drift alarms tuned to your baseline. No manual re-runs.

Frameworks

Crosswalked to regulation

Every eval is pre-mapped to EU AI Act, NIST AI RMF, ISO 42001, and SOC 2 Trust Services Criteria.

Registry

Model cards auto-generated

Pass a model through the registry; get a signed model card, DPIA draft, and OSCAL control statement.

Frequently asked

Answered by the architecture, not the sales deck.

Yes. Every eval run records seed, prompt set version, model version, and environment. Two identical runs produce byte-identical reports — the hash is part of the signed evidence.

Absolutely. Upload your internal prompt set once; it becomes a versioned eval you can schedule, share, and export. Your prompts stay in your tenant — never train our models.

Quorum (≥3 judges from different families) + rubric-pinned scoring + a human-review sample each week. Every decision records which judge, which rubric, and the final seal.

Yes — long-horizon tool-use agents, memory-augmented agents, and multi-agent pipelines. We trace through the agent graph and score per-hop plus end-to-end.