Back to Skills

llm-testing

verified

Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs.

View on GitHub

Marketplace

orchestkit

yonatangross/orchestkit

Plugin

ork

development

Repository

yonatangross/orchestkit
33stars

skills/llm-testing/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yonatangross/orchestkit/blob/main/skills/llm-testing/SKILL.md -a claude-code --skill llm-testing

Installation paths:

Claude
.claude/skills/llm-testing/
Powered by add-skill CLI

Instructions

# LLM Testing Patterns

Test AI applications with deterministic patterns using DeepEval and RAGAS.

## Quick Reference

### Mock LLM Responses

```python
from unittest.mock import AsyncMock, patch

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None
```

### DeepEval Quality Testing

```python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
]

assert_test(test_case, metrics)
```

### Timeout Testing

```python
import asyncio
import pytest

@pytest.mark.asyncio
async def test_respects_timeout():
    with pytest.raises(asyncio.TimeoutError):
        async with asyncio.timeout(0.1):
            await slow_llm_call()
```

## Quality Metrics (2026)

| Metric | Threshold | Purpose |
|--------|-----------|---------|
| Answer Relevancy | ≥ 0.7 | Response addresses question |
| Faithfulness | ≥ 0.8 | Output matches context |
| Hallucination | ≤ 0.3 | No fabricated facts |
| Context Precision | ≥ 0.7 | Retrieved contexts relevant |

## Anti-Patterns (FORBIDDEN)

```python
# ❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)

# ❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))

# ❌ NEVER skip timeout handling
await llm_call()  # No timeout!

# ✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mo

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
4180 chars