Completions

Skill catalog

LLM evaluation for semantic compliance scoring

LLM-based semantic compliance scoring against operator-defined per-vertical rule libraries — calibrated confidence scoring feeds borderline-routing, audit history, and operator override learning.

The problem

Your AI page-generator agent writes supplement product descriptions. Your regex compliance pre-filter catches obvious violations like "cures cancer." But what about "supports immune function in clinically meaningful ways"? Regex cannot catch that. Your outside counsel says it is borderline structure-function vs disease-claim language.

Braintrust ($0-$5,000+/month), LangSmith ($0-$2,000+/month), Helicone, Patronus AI, Galileo, Arize AI ($1,000-$25,000+/year), Fiddler AI ($30,000-$200,000+/year), and WhyLabs evaluate generic LLM output quality (factuality, coherence, task completion). Guardrails AI ($0-$3,000+/month), NVIDIA NeMo Guardrails, Lakera ($30-$300/month + enterprise), and Robust Intelligence prevent unsafe / off-topic / harmful output. IBM watsonx.governance ($50,000-$300,000+/year), Microsoft Responsible AI Dashboard, Credo AI ($30,000-$200,000+/year), and Holistic AI provide enterprise AI-governance audit reporting. OpenAI Moderation API (free), Google Perspective API (free), Hive ($1-$50,000+/month), and Microsoft Communication Safety filter user-generated content. Anthropic Constitutional AI, Inspect (UK AISI), and HELM (Stanford CRFM) are research-focused. DIY costs $80,000-$150,000/year per ML-engineer FTE.

The gap is LLM evaluation that loads operator-defined per-vertical compliance rule libraries (FDA structure-function, FTC claim substantiation, state-AG cannabis, FINRA suitability, OSHA chemical, Prop 65) and scores every AI content output semantically with calibrated confidence scores feeding borderline-routing.

What success looks like

Per-vertical-compliance-overlay rule libraries load into the LLM-evaluator and score every AI content output semantically. Each output receives a per-rule semantic-violation confidence score (0.0 to 1.0). Operator-defined thresholds route low-confidence pass auto-publish; high-confidence violations auto-block; borderline scores (typically 0.3-0.7) route to legal / compliance review via borderline-routing.

Pre-filter-deterministic-gates handles regex / blocklist violations first. This skill catches semantic violations regex cannot detect — structure-function vs disease-claim drift, comparative claim ambiguity, state-AG-specific dosage / efficacy language, FTC-substantiation-required claim language without matching evidence.

Multi-vertical operators get per-vertical eval profiles; multi-jurisdiction operators (cannabis, financial-services) get per-state / per-province eval profiles. Every score ties to versioned-history-regulatory-defense for audit-defensible regulator inquiry. Calibration follows operator override-history via fbc-override-learning. Braintrust and LangSmith remain useful for non-compliance LLM quality eval; this skill handles per-vertical regulatory compliance scoring.

How most operators solve this today

Six tiers of incumbent tools — none load operator-defined per-vertical regulatory rule libraries and score every AI content output semantically.

  • LLM evaluation platforms (Braintrust, LangSmith, Helicone, Patronus AI, Galileo, Arize AI, Fiddler AI, WhyLabs)

    $0-$200,000+/year

    Evaluate generic LLM output quality (factuality, coherence, task completion). Per-vertical compliance rule libraries require custom configuration.

  • LLM guardrails (Guardrails AI, NVIDIA NeMo Guardrails, Lakera, Robust Intelligence)

    $0-$3,000+/month + enterprise

    Prevent unsafe / off-topic / harmful output. Built for general AI safety; not per-vertical regulatory.

  • AI-governance suites (IBM watsonx.governance, Microsoft Responsible AI Dashboard, Credo AI, Holistic AI)

    $30,000-$300,000+/year

    Enterprise AI governance plus audit reporting; not output-time per-content compliance scoring.

  • Content moderation APIs (OpenAI Moderation, Google Perspective, Hive, Microsoft Communication Safety)

    Free-$50,000+/month

    Filter user-generated content (hate speech, harassment, sexual content). Not operator-specific regulatory compliance for brand-produced content.

  • AI-safety research tools (Anthropic Constitutional AI, Inspect / UK AISI, HELM / Stanford CRFM)

    Research / academic

    Research-focused; not production-deployable per-vertical compliance.

  • DIY (custom prompt eval scripts + manual brand-manager review + Python pytest suites)

    $80,000-$150,000/year per ML-engineer FTE

    Per-vertical rule libraries built from scratch. API drift maintenance consumes ~1/3 FTE time.

What changes when this is an agent skill

The Completions llm-semantic-compliance-scoring skill loads operator-defined per-vertical compliance rule libraries (FDA structure-function, FTC claim substantiation, state-AG cannabis, FINRA suitability, OSHA chemical, Prop 65, EU Cosmetic Regulation) and scores every AI content output semantically.

Each output receives per-rule semantic-violation confidence scores (0.0 to 1.0). Operator-defined thresholds determine pass / fail / borderline routing. Pre-filter-deterministic-gates handles regex / blocklist first; this skill catches semantic violations regex cannot detect.

Borderline scores route to borderline-routing (loop 004) for legal / compliance review. High-confidence violations auto-block; high-confidence pass auto-publish. Multi-vertical operators get per-vertical eval profiles; multi-jurisdiction operators get per-state / per-province profiles.

Every score ties to versioned-history-regulatory-defense (loop 009) for audit-defensible regulator-inquiry response. Calibration follows operator override-history via fbc-override-learning. Braintrust and LangSmith remain useful for non-compliance LLM quality eval; this skill handles per-vertical regulatory compliance scoring.

Agents that include this skill

Skills live inside agent rentals. To get this skill in production, hire any of the agents below — context-tuning at onboarding is included in the first month.

FAQ

What is LLM evaluation for compliance scoring?
Using a language model to semantically score AI content output against operator-defined per-vertical compliance rule libraries (FDA, FTC, state-AG, FINRA, OSHA, Prop 65). Returns calibrated confidence scores per rule; borderline scores route to legal / compliance review.
How is this different from Braintrust or LangSmith (LLM evaluation)?
Generic LLM-eval frameworks evaluate LLM output quality (factuality, coherence, task completion). This skill evaluates AI content against operator-specific per-vertical regulatory rule libraries.
How is this different from Guardrails AI or NVIDIA NeMo Guardrails?
Guardrails tools prevent unsafe / off-topic / harmful output (general AI safety). This skill scores AI content against per-vertical regulatory compliance specifically.
How is this different from IBM watsonx.governance or Microsoft Responsible AI Dashboard?
Enterprise AI-governance suites provide enterprise audit reporting and policy documentation. This skill provides output-time per-content compliance scoring.
How is this different from OpenAI Moderation API or Google Perspective?
Content moderation APIs filter user-generated content for hate speech, harassment, sexual content. This skill scores BRAND-PRODUCED AI content against operator-specific regulatory compliance.
What rule libraries does the skill score against?
FDA structure-function (supplements + cosmetics), FTC claim substantiation (all verticals), state-AG cannabis restrictions, FINRA suitability (financial), OSHA chemical labeling, FCC RF compliance (electronics), California Prop 65, EU Cosmetic Regulation, plus operator-defined custom libraries.
How does this compose with pre-filter-deterministic-gates and borderline-routing?
Pre-filter-deterministic-gates catches regex / blocklist violations first. This skill catches semantic violations regex cannot detect. Borderline scores route to borderline-routing for human review.
What confidence threshold should I set for borderline routing?
Operator-tunable per rule plus per vertical. Typical defaults: <0.3 auto-pass, >0.7 auto-block, 0.3-0.7 borderline review. Calibration follows operator override-history (composes with fbc-override-learning).

Hire one of the agents that includes this skill