"AI moderation" is the buzzword on every chat platform's pricing page right now. Most teams enable it on day one because it sounds responsible. Most of them shouldn't.
Here's the honest version of what AI moderation does, where it shines, and where it's overkill.
The classifier model under the hood
Practically every "AI moderation" feature in chat tools today (ours included) runs the message through a classifier — usually OpenAI's omni-moderation endpoint, sometimes Perspective from Jigsaw. The classifier returns probability scores across categories:
- Harassment — targeted insults, name-calling, threats
- Hate — slurs, dehumanization based on identity
- Sexual — explicit sexual content
- Sexual/minors — sexual content involving children (legal nightmare territory)
- Self-harm — content encouraging or describing self-harm
- Violence — content depicting or threatening violence
- Violence/graphic — gore, explicit injury
The chat platform sets a threshold per category. Score above threshold → message gets redacted, hidden, or flagged for review.
What it actually catches well
Three categories that AI moderation crushes vs. banned-words:
- Creative slur spelling.
N1gg3r,fa66ot, etc. The classifier reads context, not strings. - Hate without slurs. "[Identity group] should all be deported" doesn't trigger banned-words but lights up the hate category.
- Threats. "I know where you live and I'll come" doesn't have a banned word in it; classifier flags it as harassment.
What it misses
- Subtle harassment. "Lol look who showed up again" said for the tenth time to the same user — classifier sees benign string, room knows it's bullying.
- In-jokes and irony. A radio chat where regulars roast each other ("you absolute melt") looks aggressive to a classifier trained on Reddit. False positives.
- Cultural / linguistic edge cases. Same word with very different valence across communities ("c**t" is endearment in some UK + AU contexts, slur in US). Classifiers trained on global English split the difference and get it wrong both ways.
- Image content. The classifier reads text. A profile photo of something nasty is not caught.
- Coordinated brigading. 50 visitors arriving at once saying "lol" isn't caught by any per-message classifier. You need rate-limit / slow-mode + IP signals.
The right way to layer it
For most community chats, the optimal stack is:
- Slow mode (always on) — cooldown between posts per user. Crushes spam and most brigades.
- Banned-words list (short, 10–30 entries) — the obvious slurs + a few project-specific patterns.
- Visitor mute (let regulars hide each other) — handles in-room interpersonal stuff without admin intervention.
- AI moderation (Pro feature, threshold tuned for your community) — catches the creative spellings + threats banned-words misses.
Skip step 4 if you have fewer than ~50 active visitors per peak hour. The false-positive cost (regulars getting their banter redacted) outweighs the catches.
Tuning thresholds: start strict, loosen
Most platforms default the AI moderation threshold around 0.85 (85% confidence required to redact). For a small, well-known community where banter is the point, raise it to 0.93. For an open shopify storefront with anonymous visitors, lower it to 0.75 — false positives matter less than letting through a slur on your product page.
What this looks like in our panel
Embedded Chat's AI moderation (Pro tier) wires OpenAI's omni-moderation in three clicks: enable + pick the categories you care about + set the threshold. Flagged messages auto-redact to ****; the original is kept for admin review. The Moderation panel has a live tester so you can paste a sample message and see exactly what visitors would see after moderation runs.
The one thing AI moderation will never replace
Human judgment for context. The classifier doesn't know that your "regulars" call each other names affectionately, that your DJ has a stage name shaped like a slur, that the question about a celebrity's death is a real news event not threat-glorification. Those calls have to come from somebody who knows your room.
AI moderation is a force multiplier for that person, not a replacement.