Ember Quest player character

Ember Quest

On-Device Small Language Model Use Cases for Games

Large language models are too expensive, too slow, and too hard to fine-tune for real-time game AI. A small, domain-specific model trained on Amazon SageMaker runs on-device with low latency and zero per-inference cost.

Godot 4.6 · Qwen 2B · llama.cpp · SageMaker · Bedrock

Demo

Full Level 1 playthrough showing proactive SLM hints, dynamic difficulty adjustment, interactive Q&A, and the debug overlay.

The Problem

Most players quit a game within the first month. Frustration, missed mechanics, and difficulty walls are the top causes.

😤

Traditional tutorials are skippable and static. Players who miss mechanics struggle silently and churn.

🎚️

Difficulty settings are one-size-fits-all. No way to adapt to individual player behavior in real-time.

📏

Rule-based systems handle common scenarios. The long tail of hundreds of state combinations goes unaddressed.

Why an On-Device SLM?

One base model, multiple use cases via hot-swappable LoRA adapters. SLMs are the right choice when the output needs to be language: contextual hints, dialogue, Q&A, or natural language tool arguments.

What the SLM does

  • Hints at the right moment when a player misses a mechanic
  • Secretly adjusts difficulty when a player is struggling
  • Tracks unused abilities and nudges at the right moment
  • Stays silent 90%+ of the time

Why not a large model?

  • Too slow for real-time game loops
  • Per-inference cost scales with player count
  • Fine-tuning for a narrow domain is expensive and fragile
  • A small trained model outperforms a general large model for structured tasks
  • Works offline, no network dependency

See also: NVIDIA Research: SLMs are the Future of Agentic AI

Use Cases

All use cases share one base model in memory. Each has a separate LoRA adapter that swaps in milliseconds.

Ready

Proactive Game Assistant

Invisible AI that watches gameplay and decides when and how to help. Structured tool-call output with game-side guardrails.

Ready

Interactive Q&A (Ask Pip)

Player asks anything about the game world. RAG-powered answers using embedded lore + session context.

In Progress

AI Companion (Pip)

Companion that comments on gameplay in real-time. Free-text generation with personality constraints.

Planned

Level Director

SLM configures procedural level generation based on player preferences and session history.

Architecture

Game State JSON (every 2-7s)
    → On-Device SLM (base model + LoRA adapter)
    → Tool call / text / Q&A response
    → Game executes
      

The game code is identical across all tiers, only the transport layer changes.

Current

On-Device

Model runs locally on CPU, GPU, or NPU. Zero latency, zero cost, works offline. This is what the demo shows.

Planned

Cloud SLM (Hybrid)

Same fine-tuned model deployed to AWS as fallback for devices that can't run locally. Same API, same accuracy.

Model Stack

ComponentSizePurpose
Base model (Qwen 3.5 2B, quantized)1.2 GBText generation, always loaded
LoRA adapters40-80 MB eachPer use case, hot-swapped at runtime
Embedding model20 MBVector search for game lore (RAG)
Vector database5 MBIn-engine vector store
Total on-device1.3 GBFits on mobile (4+ GB RAM)

Training

Amazon SageMaker for fine-tuning, Amazon Bedrock for data enrichment. Low cost per training run. No real gameplay data needed to start.

At 2B parameters: structured output and RAG-grounded text work. Ungrounded free text doesn't.

Tech Stack

Game

  • Godot 4.6 (GDScript)
  • Custom llama.cpp GDExtension
  • In-engine RAG (embeddings + vector store)
  • Ninja Adventure asset pack (CC0)

Training & Infrastructure

  • Amazon SageMaker (GPU training)
  • Amazon Bedrock (teacher enrichment)
  • MLflow (experiment tracking)
  • Kiro IDE (development)

References