Testing an AI Agent Without Mocking the LLM

The backend test suite runs in-memory, needs no Docker, requires no API keys, and tests no LLM calls. Here is the abstract interface pattern that makes this possible — and why it is the right approach for testing agent-adjacent logic.

A common concern when building AI-powered applications is testing. The LLM is non-deterministic. API calls cost money. Tests that depend on a live OpenAI API are slow, flaky, and expensive to run in CI.

The solution most developers reach for is mocking: replace the LLM with a function that returns a fixed response. This works for unit tests of individual functions, but it does not test the protocol layer — the MCP tools, the A2A agent contracts, the UCP checkout flow, the AP2 mandate logic.

ShopAgent's test suite takes a different approach. It does not test the LLM at all. It tests the protocol layer exclusively: the MCP catalog, the UCP checkout, the AP2 mandate chain, and the payment failure paths. All of this runs in-memory with a temporary SQLite database and mock service implementations. No Docker. No API keys. No live calls.

WHAT GETS TESTED AND WHAT DOESN'T

ShopAgent's backend has two distinct layers:

The protocol layer — MCP catalog, A2A agents, UCP merchant, AP2 mandate chain. These are deterministic: given a set of inputs, they produce predictable outputs. They can be tested without an LLM.
The LLM layer — Intent classification, response generation. These are non-deterministic and depend on the OpenAI API. They are not tested in the automated suite.

This separation is intentional. The protocol layer is where bugs have real consequences: incorrect prices, wrong stock counts, double-counted cart items, invalid mandate signatures. The LLM layer produces text. Text quality is assessed through integration testing and manual review, not automated assertions.

THE ABSTRACT INTERFACE PATTERN

Every external dependency in ShopAgent is behind an abstract base class:

class PaymentProcessor(ABC):
    @abstractmethod
    async def charge(
        self, amount_cents: int, currency: str,
        payment_method: str, metadata: dict
    ) -> PaymentResult: ...

class ShippingProvider(ABC):
    @abstractmethod
    async def estimate(
        self, product_id: str, stock: int,
        destination_region: str = ""
    ) -> ShippingEstimate: ...

class TaxCalculator(ABC):
    @abstractmethod
    async def calculate(
        self, subtotal: float, currency: str = "USD",
        region: str = ""
    ) -> TaxResult: ...

class MandateSigner(ABC):
    @abstractmethod
    def sign(self, payload: dict) -> SignatureResult: ...

    @abstractmethod
    def verify(self, payload: dict, result: SignatureResult) -> bool: ...

The application code never imports a concrete implementation. It calls get_payment_processor(), get_shipping_provider(), etc. from the service factory. The factory reads the environment and returns the appropriate implementation.

This means the same code path runs in tests (with mock implementations) and in production (with real implementations). There is no test-specific branching in the application code.

THE SERVICE FACTORY

# backend/services/factory.py

def get_payment_processor() -> PaymentProcessor:
    provider = os.getenv("PAYMENT_PROVIDER", "mock")
    if provider == "stripe":
        return StripePaymentProcessor()
    if provider == "paypal":
        return PayPalPaymentProcessor()
    return MockPaymentProcessor()

def get_shipping_provider() -> ShippingProvider:
    provider = os.getenv("SHIPPING_PROVIDER", "mock")
    if provider == "fedex":
        return FedExShippingProvider()
    return MockShippingProvider()

def get_tax_calculator() -> TaxCalculator:
    provider = os.getenv("TAX_PROVIDER", "configurable")
    if provider == "taxjar":
        return TaxJarCalculator()
    return ConfigurableTaxCalculator(
        rate=float(os.getenv("TAX_RATE", "0.0")),
        jurisdiction=os.getenv("TAX_JURISDICTION", "DEFAULT"),
    )

def get_mandate_signer() -> MandateSigner:
    signer = os.getenv("MANDATE_SIGNER", "ed25519")
    if signer == "vault":
        return VaultMandateSigner()
    return Ed25519MandateSigner()

In tests, the environment is not set. The factory returns mock implementations for everything. In production, environment variables select real implementations.

THE MOCK IMPLEMENTATIONS

The mock implementations in services/mocks.py produce realistic, deterministic responses:

class MockPaymentProcessor(PaymentProcessor):
    async def charge(self, amount_cents, currency, payment_method, metadata):
        status = os.getenv("MOCK_PAYMENT_STATUS", "succeeded")
        return PaymentResult(
            status=status,
            transaction_id=f"mock-txn-{uuid.uuid4().hex[:8]}",
            provider="mock",
        )

class MockShippingProvider(ShippingProvider):
    async def estimate(self, product_id, stock, destination_region=""):
        if stock == 0:
            return ShippingEstimate(
                estimated_delivery="Currently unavailable — out of stock",
                carrier="mock",
            )
        return ShippingEstimate(
            estimated_delivery="3–5 business days",
            carrier="mock",
            business_days_min=3,
            business_days_max=5,
            service_level="standard",
            cost_cents=0,
        )

The MOCK_PAYMENT_STATUS environment variable controls whether payment succeeds or fails. Tests set this to "failed" to test the payment failure path. Tests set it to "succeeded" (the default) to test the happy path.

TESTING THE MCP CATALOG

The MCP catalog tests create an in-memory SQLite database, seed it with test products, and test search and filtering:

@pytest.fixture
async def db(tmp_path):
    db_path = tmp_path / "test_products.db"
    async with aiosqlite.connect(db_path) as conn:
        await conn.execute("""
            CREATE TABLE products (
                id TEXT PRIMARY KEY, name TEXT, price REAL,
                category TEXT, description TEXT, rating REAL,
                stock INTEGER, image_url TEXT
            )
        """)
        await conn.executemany(
            "INSERT INTO products VALUES (?,?,?,?,?,?,?,?)",
            TEST_PRODUCTS,
        )
        await conn.commit()
    return db_path

async def test_search_by_category(db):
    results = await search_products(db, category="running", max_price=None)
    assert all(r["category"] == "running" for r in results)

async def test_max_price_filter(db):
    results = await search_products(db, query="shoes", max_price=50.0)
    assert all(r["price"] <= 50.0 for r in results)

async def test_boundary_product_included(db):
    # A product at exactly $99.99 should be included for max_price=99.99
    results = await search_products(db, query="", max_price=99.99)
    ids = [r["id"] for r in results]
    assert "boundary-product-9999" in ids

No Docker. No running MCP server. The test creates a temporary SQLite database, calls the search function directly, and asserts on the result. The test runs in milliseconds.

TESTING UCP CHECKOUT

The UCP checkout tests verify the session lifecycle, tax calculation, and money arithmetic:

async def test_checkout_creates_session_with_correct_totals():
    cart = [
        CartItem(product_id="shoe-001", name="Nike Pegasus", price=89.99, quantity=1),
        CartItem(product_id="shoe-002", name="NB 880", price=74.99, quantity=2),
    ]
    session = await create_checkout_session(cart=cart)

    assert session["status"] == "created"
    # Subtotal must be exact — test with integer cents, not floats
    assert session["subtotal_cents"] == 8999 + 7499 * 2  # 23997

async def test_all_amounts_are_integer_cents():
    session = await create_checkout_session(cart=[...])
    # No floating-point values in the session
    assert isinstance(session["subtotal_cents"], int)
    assert isinstance(session["tax_cents"], int)
    assert isinstance(session["total_cents"], int)

The "all amounts are integer cents" test is a contract test — it verifies that the UCP implementation honours the integer-only money representation that AP2 depends on. A float here would create a rounding discrepancy between the session total and the mandate total.

TESTING THE AP2 MANDATE CHAIN

The AP2 tests verify that the mandate signature links the correct intent, cart, and total:

async def test_mandate_signs_correct_payload():
    signer = Ed25519MandateSigner()
    cart = [CartItem(product_id="shoe-001", price=89.99, quantity=1, ...)]
    payload = {
        "intent": "purchase",
        "session_id": "sess-abc",
        "cart_items": [c.model_dump() for c in cart],
        "total_cents": 8999,
        "currency": "USD",
    }
    result = signer.sign(payload)

    assert signer.verify(payload, result) is True

async def test_tampered_payload_fails_verification():
    signer = Ed25519MandateSigner()
    payload = {"intent": "purchase", "total_cents": 8999, ...}
    result = signer.sign(payload)

    tampered = {**payload, "total_cents": 89990}  # 10x the agreed amount
    assert signer.verify(tampered, result) is False

The tampered payload test is the critical one. It verifies that changing the total after signing invalidates the mandate — which is the entire point of AP2.

TESTING PAYMENT FAILURE PATHS

The MOCK_PAYMENT_STATUS environment variable enables payment failure tests:

async def test_payment_failure_returns_402():
    os.environ["MOCK_PAYMENT_STATUS"] = "failed"
    try:
        response = await complete_checkout(
            session_id="sess-abc",
            shipping_address={...},
            payment_method="mock_card",
        )
        assert response["status"] == 402
        # Audit trail must still be created even on failure
        audit = await get_audit_record(session_id="sess-abc")
        assert audit is not None
        assert audit["payment_status"] == "failed"
    finally:
        os.environ.pop("MOCK_PAYMENT_STATUS", None)

async def test_double_complete_returns_409():
    # Completing the same session twice should be idempotent
    await complete_checkout(session_id="sess-xyz", ...)
    response = await complete_checkout(session_id="sess-xyz", ...)
    assert response["status"] == 409

These tests cover failure paths that are hard to test with a live payment processor. The mock makes them deterministic.

WHAT WE'D DO DIFFERENTLY

Integration tests against the running containers — The current tests cover the protocol layer in isolation. What they do not cover is the full stack: does the orchestrator correctly call the MCP server, does the MCP server correctly query the database, does the response round-trip correctly back through the A2A contract. A set of integration tests that spin up Docker containers using testcontainers and make real HTTP calls would catch contract mismatches.

Property-based testing for money arithmetic — The total_cents = subtotal_cents + tax_cents invariant should hold for all combinations of cart items and tax rates. hypothesis (a Python property-based testing library) could generate thousands of random carts and verify the invariant holds across all of them.

THE TAKEAWAY

The abstract interface pattern is the foundational decision that makes the test suite possible. By hiding every external dependency behind an ABC, the application code is testable with any implementation — mock or real — without modification.

The tests cover exactly the code that matters most: the protocol contracts where a bug has real consequences (wrong prices, invalid mandates, missing orders). The LLM is not tested because it is not deterministic and its outputs are not the protocol layer's responsibility.

This approach scales to production: when you swap a mock for a real Stripe payment processor, you add a StripePaymentProcessor class and register it in the factory. No test infrastructure changes. The same test fixtures that tested the mock payment flow will test the real one — you just change the environment variable.

The ShopAgent demo is live at https://shop-agent.agilecreativeminds.nl. See the demo showcase or follow the demo walkthrough. Built by Agile Creative Minds.