For brand directors + AI platform leadership + regulatory counsel
Profanity filters block profanity. The phrases your brand actually needs blocked are competitor names, off-brand language, and regulated claims.
Spectrum Labs, Hive Moderation, Two Hat, Sift, OpenAI Moderation API, Google Perspective API ship the universal-category classifier — profanity, toxicity, hate speech, slurs. None of them know that the spa banner does not want three specific competitor names in any output. None of them know that the cannabis dispensary cannot mention specific medical-treatment claims. The per-brand maintained phrase library is operator-side wiring on top of the moderation primitive.
What this gets you
- Six-category phrase taxonomy maintained per brand — competitor names + off-brand language + regulated claims + profanity + slurs + trademark- protected phrases. Each phrase carries category + severity + per-language variant + per-brand applicability scope.
- Contextual LLM classification on top of the universal-category moderation API — phrases evaluated for violating-context vs legitimate-context use; tuned classifier-confidence thresholds per category reduce false positives without missing true violations.
- Multi-language coverage with per-language NLP — English + Spanish + French + German + Portuguese baseline; per-brand maintenance for any additional language the operator footprint requires.
- Integration with the 6-axis brand-consistency control plane — the Block axis sits alongside Version + Author + Substantiate + Extract + Gate. The runtime Gate consumes the forbidden-phrase library on every AI output across all 13 consumer agents.
- Editorial review queue + per-franchisee override controls — flagged outputs route to brand-team reviewers; decisions feed classifier-confidence tuning per phrase category; per-franchisee overrides surface to brand for audit without requiring approval.
The competitor name slips through three times a week
A multi-banner operator runs a spa banner, a gym banner, and a med-spa banner. Each banner has three or four specific local competitors the brand team has flagged as off-limits for any AI-generated customer-facing content. The reasoning varies — ongoing legal dispute with one competitor, recent acquisition negotiations that fell apart with another, a reputational concern with a third where any mention boosts their visibility. The brand team published a memo. The marketing team filed the memo. The AI agents that generate review responses, social posts, ad copy, and product descriptions know nothing about it.
The review-response agent drafts a reply to a customer review that mentioned the customer’s prior visit to one of the flagged competitors. The reply acknowledges the prior visit by name. The reply publishes. Three days later the brand director sees the post during a routine Google Business Profile audit. The reply is now indexed by Google. Removing it does not undo the search-result visibility. The brand director sends another memo.
The social-publishing agent drafts a post that compares the operator service to a generic competitor offering. The comparison is favorable to the operator but the post names a specific competitor the brand team has flagged. The post publishes during a high-engagement window. The competitor screenshots the post and uses it in their own marketing as “even [operator] thinks we are worth comparing to”. The brand director sends another memo.
A maintained forbidden-phrase library closes the gap. The brand team adds the three competitor names to the spa-banner library with the competitor category tag and the appropriate severity. The library propagates to the runtime brand-voice gate. The next AI output that would have named one of those competitors gets evaluated against the library at sub-100ms latency. The contextual classifier scores the use as violating rather than legitimate (the customer-visit reference gets scored differently from a comparison post). Hard violations block; borderline scores route to the editorial review queue.
What is in market — and what each category leaves to you
The content-moderation classifier primitive is mature. The per-brand phrase library plus contextual classification layer is operator-side architecture.
Enterprise content moderation — Spectrum Labs, Hive Moderation, Two Hat (Microsoft), Sift, Bodyguard.ai, Sightengine, WebPurify, Modulate
Excellent at universal-category content moderation with multi-modal coverage (text, image, audio, video). The per-brand phrase library (the actual competitor names, off-brand phrases, regulated claims the operator wants blocked), the per-vertical regulated-claim categories tied to HIPAA + FDA + FINRA + cannabis-state libraries, and the integration with the 6-axis brand-consistency control plane are operator-side wiring on top of the moderation primitive.
AI / LLM content moderation — OpenAI Moderation API, Anthropic Claude Moderation, Google Perspective API, Azure Content Safety, AWS Comprehend Toxicity
Strong at LLM-output category scoring (hate speech, self-harm, sexual content, violence, toxicity). The per-brand custom-category extension (competitor names as a brand-specific category not in the default taxonomy), the contextual classification tuned per category, and the per-language phrase- variant maintenance are operator-side architecture.
Brand safety adjacency — DoubleVerify, Integral Ad Science, MOAT, Pixalate, Channel Factory, Zefr
Strong at ad-placement brand-safety verification — ensuring paid ads do not appear adjacent to unsafe content. Adjacent to AI-output content moderation but a different surface; the per-brand forbidden-phrase library applies to AI-generated output before it ships, not to ad-placement verification after.
Open-source profanity filters — profanity-check (Python), Better Profanity (Python), bad-words (JS), Detoxify (multi-language)
Capable of basic profanity blocking on the operator side at zero licensing cost. Contextual classification, per-brand maintained competitor libraries, per-vertical regulated-claim libraries, and integration with the brand-voice runtime gate are the operator build above the library primitive.
The Slack message that says please do not mention Competitor X
The status quo at most multi-brand operators. The brand director sends a Slack message or an email memo to the marketing team. The team files the message. The AI agents do not subscribe to Slack or email. The AI output that mentions Competitor X publishes. The next memo gets sent after the next incident.
The pipeline, end to end
- Position in the 6-axis brand-consistency control plane. Version + Author + Block + Substantiate + Extract + Gate. The Block axis (this skill) maintains the negative-list rule library. The Gate axis consumes Block at runtime across all 13 consumer agents.
- Six-category phrase taxonomy. Competitor names (per-banner competitive set), off- brand language (phrases that drift from voice spec), regulated claims (per-vertical claims that require substantiation under HIPAA + FDA + FINRA + cannabis- state law), profanity (universal-category baseline), slurs (universal-category baseline), trademark- protected phrases (operator + competitor trademarks).
- Per-brand × per-banner × per-language schema. The library is multi-dimensional. Each phrase has a brand applicability (parent brand vs banner-specific), a per-banner override set, a per-language variant set, a severity classification, and a category tag. Multi-banner operators see per-banner phrase libraries that share the parent-brand baseline plus per-banner additions.
- Phrase-extraction maintenance pipeline. Competitor names ingest from the operator competitive- intelligence system. Regulated-claim phrases ingest from the source regulatory documents using the same rule-extraction-from-source-docs pipeline that feeds the master-record validation libraries and the marketing-compliance overlay. Off-brand phrases ingest from manual brand-team additions plus extracted voice-drift signals from the Extract axis.
- Universal-category moderation primitive. Profanity, slurs, hate speech, toxicity, sexual content evaluate via the operator-licensed content- moderation API (Spectrum Labs, Hive Moderation, OpenAI Moderation, Google Perspective). The operator wires the platform of choice; the orchestration treats it as a pluggable primitive.
- Contextual classification on top of the universal primitive. Phrases in the per-brand library evaluate via LLM classification that scores whether the phrase appears in a violating context or a legitimate context. The classifier output combined with the phrase-library lookup produces the block decision. Classifier confidence thresholds tuned per phrase category to trade off false positives against false negatives.
- Brand-voice runtime gate integration. The Gate axis (covered by the brand-voice-management pillar) consumes the forbidden-phrase library on every AI output across all 13 consumer agents at sub-100ms latency. The Block evaluation runs as one of the rule libraries in the Gate evaluation chain alongside Substantiate + Voice-attribute + per-banner override + per-vertical override + per-location override.
- Editorial review queue for flagged outputs. Borderline classifier scores route to a brand-team reviewer queue with the candidate output, the matched phrase, the category tag, and the classifier score surfaced. Reviewers approve or reject; decisions feed classifier-confidence-threshold tuning per phrase category per cycle.
- Per-franchisee override controls.Franchisees who believe a block is incorrect for their territory (the spa banner Phoenix franchisee says the flagged competitor closed last quarter) submit an override request. The override surfaces to brand for audit but does not require approval — franchisee owns territory operating signal within the corporate envelope. Override usage tracked per phrase per franchisee.
- Multi-language variant maintenance. Each forbidden phrase carries per-language variants. English + Spanish + French + German + Portuguese baseline maintenance; per-operator additional language support based on footprint. Per-language NLP tooling handles morphology + acronym expansion + slang + paraphrase detection.
- Audit log + override workflow. Every block event stores the candidate output, the matched phrase, the category, the severity, the classifier score, the decision (block / allow / editorial-queue), the reviewer (if escalated), and the final outcome. Audit trail is queryable per franchise + per phrase category + per time period for brand-protection and regulator response.
- Quality measurement (precision + recall + per-brand false-positive rate). Per-brand per-category precision, recall, and false- positive rate measured against periodic editorial spot-checks. The signal feeds classifier tuning and phrase-library maintenance prioritization per cycle. Performance latency budget tracked separately to ensure the Gate stays under the sub-100ms target.
- Integration with the compliance-mechanic cluster. The regulated-claim phrase category shares substrate with the per-vertical compliance libraries (per-vertical-schema-validation on master-record + per-jurisdiction-compliance on citation + catalog- per-vertical-schema-validation on catalog + marketing- compliance overlay across the AI-agent fleet). The Block axis is one of the consumers of the regulatory rule libraries the compliance-overlay-manager agent maintains.
Frequently asked
What is a forbidden-phrase library?
A forbidden-phrase library is a maintained list of phrases the operator wants blocked from appearing in any customer-facing output. The list combines six categories — profanity, slurs, competitor names the brand does not want to mention, off-brand language that drifts from the voice spec, regulated claims that require substantiation the operator does not have, and trademark-protected phrases. Each phrase carries a category tag, a severity classification, a per-language variant set, and a per-brand applicability scope. The library is consumed by the runtime brand-voice gate that intercepts every AI-generated output before publish.
How is this different from Spectrum Labs, Hive Moderation, Two Hat, Sift, OpenAI Moderation API, or Google Perspective API?
Those platforms ship the content-moderation primitive — the classifier that evaluates a piece of text against profanity, toxicity, hate-speech, and slur categories. They are excellent at the underlying classifier layer. The per-brand competitor-name library (the actual competitor brands the operator does not want named), the per-brand off-brand-language library (the phrases that drift from the operator voice spec), the per-vertical regulated-claim library (the claims that require substantiation under HIPAA, FDA, FINRA, or cannabis-state law), the per-language variant maintenance, and the integration with the broader 6-axis brand-consistency control plane are operator-side architecture on top of the moderation classifier.
Why does a generic content-moderation API fail multi-brand operators?
Generic content-moderation APIs classify against universal categories — is this profanity, is this hate speech, is this toxicity. They do not know that the spa banner does not want the name of three specific local-competitor spas to appear in any output. They do not know that the cannabis-dispensary banner cannot mention specific medical-treatment claims. They do not know that the operator brand voice forbids superlatives. The per-brand phrase library and the contextual classification on top of the universal moderation layer is the brand-specific extension that operators build for themselves or buy through orchestration.
What is contextual classification and why does it matter for forbidden-phrase blocking?
Naive string-matching blocks any output that contains the literal phrase. That approach produces unacceptable false-positive rates. A competitor name might appear in a legitimate review-response context ("we appreciate that you previously visited [competitor name]"). A regulated-claim phrase might appear in a substantiated context ("our peer-reviewed study published in [journal] demonstrated [claim]"). Contextual classification uses LLM-based evaluation to score whether the phrase appears in a violating context or a legitimate context. The classifier output combined with the phrase-library lookup produces the block decision. Operators tune the classifier confidence threshold per phrase category.
How does the Block axis fit into the 6-axis brand-consistency control plane?
The brand-spec-authoring agent owns six axes — Version (PR-style spec versioning, covered by /franchise-brand-portal), Author (structured-spec authoring), Block (this skill — forbidden-phrase library), Substantiate (claims-allowlist with evidence mapping), Extract (LLM-derive voice from corpus, covered by /voice-attribute-extraction), and Gate (runtime cross-agent enforcement, covered by /brand-voice-management). The forbidden-phrase library is one of the rule libraries the runtime Gate consumes. Block contributes the negative-list capability; Substantiate contributes the positive-claim capability; the two together define what AI output can and cannot say at the brand level.
How do you handle false positives in the forbidden-phrase library?
False positives are real. Three mechanisms defend against them. First, contextual classification reduces false positives at the inference layer — the LLM evaluates whether the phrase appears in a violating context before blocking. Second, an editorial review queue collects flagged outputs and routes them to brand-team reviewers who approve or reject; the reviewer decisions feed classifier-confidence-threshold tuning per phrase category. Third, per-recipient override controls let franchisees flag specific blocks as incorrect for their territory; per-franchisee overrides surface to brand for audit but do not require approval (franchisee owns their territory operating signal within the corporate envelope).
Hire the agent that owns the Block axis
The brand-spec-authoring agent owns the 6-axis brand-consistency control plane — Version + Author + Block + Substantiate + Extract + Gate — sitting on top of whichever content-moderation primitive (Spectrum Labs, Hive Moderation, Two Hat, Sift, OpenAI Moderation API, Google Perspective API, Azure Content Safety) you license downstream. Per-brand phrase library maintained across six categories, contextual LLM classification on top of the universal moderation, multi- language coverage, editorial review queue, per-franchisee override controls, regulator-grade audit trail.
We scope on the call and send a private checkout link after.
Related reading: Brand-voice runtime gate · Brand voice extraction · Franchise brand portal