Skip to main content

Platform Setup

This guide walks you through setting up Convoy from sign-up to your first live test.
1

Sign up

Go to app.convoylabs.com and create your account. This creates your organization — all agents, tests, and team members live under it.
2

Create an agent

An agent represents one testable endpoint — a model, prompt, workflow, or any unit you want to roll out.
  1. Click Create Agent
  2. Configure your two environments:
    • Stable URL — your current production backend (e.g. https://api.acme.com/agent)
    • Testing URL — the environment running your new version (e.g. https://api-test.acme.com/agent)
  3. Convoy generates two values you need for integration:
    • Proxy URL — where your client sends requests (e.g. acme--chatbot.proxy.convoylabs.com)
    • Shared secret — used by the client as a bearer token and by the agent for signature verification
Create New Agent dialog with fields for Agent Name, Description, First Version Name, Stable URL pointing to the production endpoint, and Test URL pointing to the testing endpoint
Save your shared secret immediately — it won’t be shown again. You’ll need it for both the client (as a bearer token) and the agent backend (for signature verification).
Agent Created Successfully dialog showing the proxy hostname, a curl command example for calling the proxy, and the shared secret with a warning to save credentials immediately
Both your stable and testing environments must be running and reachable. Convoy routes traffic to both — stable serves your current production version, and the testing URL serves the new version you want to evaluate.
Now integrate Convoy into your code:
3

Deploy a test

Once integrated and both environments are live, deploy a test to start routing traffic to the new version.
  1. Open your agent and click Deploy Test
  2. Configure the judge:
    • Judge model — the LLM that evaluates each session
    • Judge prompt — describes what to evaluate for your specific change. The judge receives each session’s input and output (reported via the session ingest endpoint) and scores it. The judge doesn’t see your agent’s system prompt, tools, or any other context — include whatever it needs to evaluate in the judge prompt itself.
  3. Set thresholds that control automatic decisions:
    • Promote — when the test version meets this bar, Convoy increases its traffic share
    • Rollback — when the test version falls below this bar, Convoy cancels the test and sends all traffic back to stable
    Deploy New Test Version dialog with fields for version name and test URL, plus an LLM Judge section with judge model selector and evaluation criteria prompt describing what the judge should score Thresholds configuration with two columns — Ready for Promotion on the left with latency p95 below, error rate below, judge score above, and token cost below fields, and Rollback on the right with the inverse thresholds
Thresholds are evaluated against the metrics your agent reports via the session ingest endpoint — latency, error rate, cost, and judge scores. If your agent doesn’t report a metric, its threshold can’t be evaluated. For multi-step sessions, latency and cost are summed across all steps to produce session totals.
You can modify advanced rollout plan and evaluation settings based on your traffic level.
4

Monitor and act

After deploying, the agent page is your control center:
  • Rollout status — current traffic split, session counts, and judge scores
  • Pause — freeze the traffic split to investigate
  • Modify traffic — manually adjust the percentage going to the test version
  • Roll back — cancel the test and send all new sessions to stable
  • Promote — mark the test as the new stable. Merge your changes in your codebase, deploy to your stable environment, then promote on Convoy to route all traffic to stable Agent dashboard showing the agent name, current traffic percentage, action buttons for Rollback Test, Promote to Stable, Pause Test, and Set Traffic %, and the ongoing window with live latency p95, error rate, judge average, and token cost metrics with promotion and rollback threshold indicators Completed Windows Over Time heatmap showing 25 evaluation windows across latency, error rate, judge, and cost rows — each cell is green for pass, red for fail, or gray for insufficient data — with a traffic percentage row below showing automatic ramp-up from 5% to 100% Agent version list table with columns for version name, status badge (Test, Stable, or Sunset), pinned sessions, latency p95, error rate, judge average, token cost, and last activity timestamp