Skip to main content

LLM Client Architecture

The AI service uses an abstract LLMClient interface with concrete implementations for Anthropic (primary) and OpenAI (fallback), plus a CachingLLMClient decorator for semantic response caching.

Architecture

CachingLLMClient (decorator)
└── AnthropicClient (primary)
└── fallback → OpenAIClient

Abstract Interface

# services/ai/src/gospelib_ai/llm/client.py
from abc import ABC, abstractmethod

class LLMClient(ABC):
@abstractmethod
async def complete(
self,
system: str,
messages: list[dict],
max_tokens: int = 1024,
) -> str:
"""Send a completion request to the LLM provider."""
...

All LLM interactions go through this interface, making it straightforward to swap providers or add new ones.

Anthropic Client (Primary)

class AnthropicClient(LLMClient):
def __init__(self, api_key: str, model: str = "claude-sonnet-4-20250514"):
self._client = AsyncAnthropic(api_key=api_key)
self._model = model

async def complete(
self,
system: str,
messages: list[dict],
max_tokens: int = 1024,
) -> str:
response = await self._client.messages.create(
model=self._model,
max_tokens=max_tokens,
system=system,
messages=messages,
)
return response.content[0].text
  • Uses AsyncAnthropic for non-blocking I/O
  • Default model: claude-sonnet-4-20250514 (configurable via env var)
  • The system prompt contains scholarly context and style guidelines

OpenAI Client (Fallback)

The OpenAI client follows the same interface, activated when the Anthropic client is unavailable or returns errors.

Caching Decorator

class CachingLLMClient(LLMClient):
"""Semantic caching — avoids redundant LLM calls for similar prompts."""

def __init__(self, inner: LLMClient, redis_client, ttl: int = 3600):
self._inner = inner
self._redis = redis_client
self._ttl = ttl

async def complete(
self,
system: str,
messages: list[dict],
max_tokens: int = 1024,
) -> str:
cache_key = self._hash(system, messages)
if cached := await self._redis.get(f"gl:ai:cache:{cache_key}"):
log.info("llm_cache_hit", cache_key=cache_key)
return cached.decode()

result = await self._inner.complete(system, messages, max_tokens)
await self._redis.setex(
f"gl:ai:cache:{cache_key}", self._ttl, result
)
return result

How Caching Works

  1. Hash the prompt — The system prompt and messages are hashed to create a deterministic cache key
  2. Check Redis — Look up gl:ai:cache:<hash> in Redis
  3. On hit — Return cached response immediately (no LLM call)
  4. On miss — Call the inner LLM client, cache the result, then return it

Cache Key Format

gl:ai:cache:<sha256(system + messages)> → TEXT (TTL: 3600s)

Why Cache?

  • LLM API calls cost money and take 1–5 seconds
  • Many users ask the same questions about popular passages
  • Scripture content is immutable — explanations for the same passage + context don't change
  • The 1-hour TTL balances freshness with cost savings

Composing the Client

# Startup configuration
anthropic = AnthropicClient(api_key=settings.anthropic_api_key)
cached_client = CachingLLMClient(anthropic, redis_client, ttl=3600)

# Use cached_client for all LLM interactions
response = await cached_client.complete(system_prompt, messages)

The decorator pattern allows adding caching without modifying any provider implementation.