Skip to main content

Moderation Models

Protect your users and platform with AI-powered content moderation. These models detect harmful, inappropriate, or policy-violating content.

Available Models

Meta’s efficient safety model for content moderation. FREE via Groq.
SpecificationValue
ProviderMeta (via Groq)
Base ModelLlama 3 8B
Categories11 safety categories
PriceFREE
Latency~100ms
Best for:
  • All content moderation use cases
  • Production safety systems
  • Real-time filtering
  • Cost-free deployment
response = client.moderations.create(
    model="llama-guard-3-8b",
    input="Content to moderate"
)

if response.results[0].flagged:
    print("Content violates policy")
Free Model: This is our recommended model for content moderation. It offers excellent accuracy at zero cost via Groq’s free inference tier.

Llama Guard 3

Meta’s latest safety model built on Llama 3, offering the best accuracy for content moderation.
SpecificationValue
ProviderMeta
Base ModelLlama 3
Categories11 safety categories
Price$0.20 / million tokens
Latency~150ms
Best for:
  • High-accuracy requirements
  • Comprehensive category detection
  • Production safety systems
  • Regulatory compliance
response = client.moderations.create(
    model="llama-guard-3",
    input="Content to moderate"
)

if response.results[0].flagged:
    print("Content violates policy")
Detected Categories:
  • Hate speech and discrimination
  • Harassment and bullying
  • Violence and threats
  • Self-harm content
  • Sexual content
  • Illegal activities
  • Personal information exposure

ShieldGemma

Google’s efficient safety model optimized for speed and cost.
SpecificationValue
ProviderGoogle
Base ModelGemma
Categories8 safety categories
Price$0.15 / million tokens
Latency~100ms
Best for:
  • Cost-sensitive applications
  • High-volume moderation
  • Real-time filtering
  • Basic safety requirements
response = client.moderations.create(
    model="shieldgemma",
    input="Content to moderate"
)

# Check category scores
scores = response.results[0].category_scores
if scores["violence"] > 0.5:
    flag_for_review()

Model Comparison

FeatureLlama Guard 3 8B (FREE)Llama Guard 3ShieldGemma
Accuracy★★★★★★★★★★★★★★☆
Speed★★★★★★★★★☆★★★★★
PriceFREE$0.20/M$0.15/M
Categories11118
Best ForAll use casesHigh-stakesHigh-volume

Safety Categories

Both models detect these core categories:
CategoryDescription
hateContent expressing hatred toward protected groups
hate/threateningHateful content with threats of violence
harassmentContent meant to harass, bully, or intimidate
harassment/threateningHarassment with explicit threats
self-harmContent promoting or glorifying self-harm
self-harm/intentExpression of intent to self-harm
self-harm/instructionsInstructions for self-harm
sexualSexually explicit content
sexual/minorsSexual content involving minors
violenceContent depicting violence
violence/graphicGraphic depictions of violence

Response Format

{
  "id": "modr-abc123",
  "model": "llama-guard-3",
  "results": [
    {
      "flagged": false,
      "categories": {
        "hate": false,
        "harassment": false,
        "self-harm": false,
        "sexual": false,
        "violence": false
      },
      "category_scores": {
        "hate": 0.0001,
        "harassment": 0.0023,
        "self-harm": 0.0001,
        "sexual": 0.0012,
        "violence": 0.0008
      }
    }
  ]
}

Use Cases

Check user messages before processing:
def validate_input(message):
    result = client.moderations.create(
        model="llama-guard-3",
        input=message
    ).results[0]

    if result.flagged:
        raise ContentPolicyError(
            "Your message violates our content policy"
        )

    return message
Verify AI responses before showing to users:
def safe_generate(prompt):
    # Generate response
    response = client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": prompt}]
    )
    content = response.choices[0].message.content

    # Moderate output
    moderation = client.moderations.create(
        model="llama-guard-3",
        input=content
    ).results[0]

    if moderation.flagged:
        return "I cannot provide that response."

    return content
Use category scores for fine-grained control:
def custom_moderation(text, thresholds):
    result = client.moderations.create(
        model="llama-guard-3",
        input=text
    ).results[0]

    violations = []
    for category, threshold in thresholds.items():
        score = result.category_scores.get(category, 0)
        if score > threshold:
            violations.append(category)

    return violations

# Strict for violence, lenient for mild language
thresholds = {
    "violence": 0.3,
    "harassment": 0.7,
    "hate": 0.5
}
Moderate multiple items efficiently:
comments = ["comment 1", "comment 2", "comment 3"]

result = client.moderations.create(
    model="shieldgemma",  # Faster for batches
    input=comments
)

for i, r in enumerate(result.results):
    if r.flagged:
        print(f"Comment {i} flagged: {comments[i]}")

Best Practices

Moderate Both Directions

Check both user inputs AND AI outputs for comprehensive safety

Use Custom Thresholds

Adjust category_scores based on your platform’s needs

Log for Review

Keep logs of flagged content for human review and model improvement

Graceful Degradation

Have fallback behavior when moderation service is unavailable

Performance Considerations

ScenarioRecommended Model
All use cases (FREE)Llama Guard 3 8B
Real-time chatLlama Guard 3 8B (FREE)
User-generated contentLlama Guard 3 8B (FREE)
High-volume batchesLlama Guard 3 8B (FREE)
Regulatory complianceLlama Guard 3

Choosing a Model

Recommendation: Use llama-guard-3-8b for all content moderation. It’s free via Groq, fast, and offers the same 11 safety categories as the paid version.