Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs.
View on GitHubyonatangross/skillforge-claude-plugin
orchestkit-complete
January 23, 2026
Select agents to install to:
npx add-skill https://github.com/yonatangross/skillforge-claude-plugin/blob/main/./skills/llm-testing/SKILL.md -a claude-code --skill llm-testingInstallation paths:
.claude/skills/llm-testing/# LLM Testing Patterns
Test AI applications with deterministic patterns using DeepEval and RAGAS.
## Quick Reference
### Mock LLM Responses
```python
from unittest.mock import AsyncMock, patch
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None
```
### DeepEval Quality Testing
```python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
metrics = [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
]
assert_test(test_case, metrics)
```
### Timeout Testing
```python
import asyncio
import pytest
@pytest.mark.asyncio
async def test_respects_timeout():
with pytest.raises(asyncio.TimeoutError):
async with asyncio.timeout(0.1):
await slow_llm_call()
```
## Quality Metrics (2026)
| Metric | Threshold | Purpose |
|--------|-----------|---------|
| Answer Relevancy | ≥ 0.7 | Response addresses question |
| Faithfulness | ≥ 0.8 | Output matches context |
| Hallucination | ≤ 0.3 | No fabricated facts |
| Context Precision | ≥ 0.7 | Retrieved contexts relevant |
## Anti-Patterns (FORBIDDEN)
```python
# ❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)
# ❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))
# ❌ NEVER skip timeout handling
await llm_call() # No timeout!
# ✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mo