Fix cohere.TooManyRequestsError 429: Cohere Trial Key Rate Limit Exceeded

The Error

cohere.errors.too_many_requests_error.TooManyRequestsError: status_code: 429, body: You are using a Trial key, which is limited to 10 API calls / minute. Upgrade to a Production key to increase your rate limit.

Ten calls per minute. That's the hard ceiling on a Cohere Trial key, and the SDK throws TooManyRequestsError the instant you cross it — no automatic retry, no grace period, just a 429.

Root Cause

Cohere Trial keys cap you at exactly 10 API calls per minute. Call number eleven comes back as HTTP 429, immediately. Common culprits:

Looping over a list of inputs and calling co.embed() or co.chat() without any delay
Parallel async calls with asyncio that all fire at once
Running a script multiple times in quick succession during development
Batch processing that sends all items in a tight loop

Step-by-Step Fix

Option 1 — Add a simple sleep between calls (quickest fix)

Small list, no rush? Drop a 6-second delay between calls. That math works out to exactly 10 calls per minute — right at the Trial key ceiling.

import cohere
import time

co = cohere.ClientV2("YOUR_API_KEY")

texts = ["text one", "text two", "text three"]  # your inputs

for text in texts:
    response = co.embed(
        texts=[text],
        model="embed-english-v3.0",
        input_type="search_document",
    )
    print(response.embeddings)
    time.sleep(6)  # 10 calls/min = 1 call per 6 seconds

Option 2 — Retry with exponential backoff (recommended for production-like code)

Rather than letting a 429 crash your script, catch the error and retry with increasing wait times. Start at 10 seconds and double each attempt: 10s → 20s → 40s → 80s → 160s.

import cohere
import time
from cohere.errors import TooManyRequestsError

co = cohere.ClientV2("YOUR_API_KEY")

def embed_with_retry(text: str, max_retries: int = 5) -> list:
    wait = 10  # seconds
    for attempt in range(max_retries):
        try:
            response = co.embed(
                texts=[text],
                model="embed-english-v3.0",
                input_type="search_document",
            )
            return response.embeddings[0]
        except TooManyRequestsError:
            if attempt == max_retries - 1:
                raise
            print(f"Rate limited. Waiting {wait}s before retry {attempt + 1}/{max_retries}...")
            time.sleep(wait)
            wait *= 2  # exponential backoff: 10s → 20s → 40s
    return []

# Usage
result = embed_with_retry("Hello, world")
print(result)

Option 3 — Batch inputs into a single call

Here's the biggest lever available to you: the embed endpoint accepts a full list of texts in one shot. Passing 50 texts in a single co.embed() call uses the same 1 quota unit as passing just 1. Send everything at once instead of looping item by item.

import cohere

co = cohere.ClientV2("YOUR_API_KEY")

texts = [
    "First document to embed",
    "Second document to embed",
    "Third document to embed",
    # ... up to hundreds of texts in one call
]

response = co.embed(
    texts=texts,
    model="embed-english-v3.0",
    input_type="search_document",
)

for i, embedding in enumerate(response.embeddings):
    print(f"Text {i}: {len(embedding)} dimensions")

Each co.embed() call counts as one API call regardless of how many texts you pass in (up to the token limit). That means 100 texts costs 1 quota unit instead of 100. Use this before reaching for any sleep or retry logic.

Option 4 — Use a token bucket / rate limiter

Running async code or a more complex pipeline? The ratelimit library lets you declare the cap once as a decorator and forget about it.

pip install ratelimit

import cohere
from ratelimit import limits, sleep_and_retry

co = cohere.ClientV2("YOUR_API_KEY")

CALLS_PER_MINUTE = 9  # stay under 10 to give headroom
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=CALLS_PER_MINUTE, period=ONE_MINUTE)
def rate_limited_chat(message: str) -> str:
    response = co.chat(
        model="command-r-plus",
        messages=[{"role": "user", "content": message}],
    )
    return response.message.content[0].text

# Now you can call this in a loop without worrying about 429s
for msg in ["What is Python?", "Explain APIs", "What is embeddings?"]:
    print(rate_limited_chat(msg))

Option 5 — Upgrade to a Production key

Need real throughput? The Trial key won't get you there. Go to dashboard.cohere.com → API Keys → create a Production key. Production keys support thousands of calls per minute depending on your plan — required for anything beyond local testing.

Verify the Fix

Paste this smoke test to confirm your fix actually holds. It sends 12 calls — two over the per-minute limit — with a 6-second gap between each one:

# Smoke test — 12 calls at 10/min to verify the rate limiter holds
import cohere
import time
from cohere.errors import TooManyRequestsError

co = cohere.ClientV2("YOUR_API_KEY")

success = 0
for i in range(12):
    try:
        co.embed(
            texts=[f"test text {i}"],
            model="embed-english-v3.0",
            input_type="search_document",
        )
        success += 1
        time.sleep(6)  # 6s gap = 10 calls/min
    except TooManyRequestsError as e:
        print(f"Still rate limited at call {i}: {e}")
        break

print(f"{success}/12 calls succeeded")

Seeing 12/12 calls succeeded means you're clear.

Tips

Check your key type first: Head to dashboard.cohere.com → API Keys. Trial keys are labeled "Trial". If yours already shows "Production" and you're still hitting 429s, something else is causing the limit.
Batch before you sleep: Always try batching multiple texts into one co.embed() call before reaching for time.sleep(). One call with 100 texts is far more efficient than 100 calls with one text each.
Target 9 calls/minute, not 10: Give yourself a one-call buffer to account for clock drift and any background requests your code might trigger invisibly.
Log your waits during development: Print something before each time.sleep(). Accidental loops become obvious — instead of the script silently hanging, you'll see exactly when it's throttling itself.
Check whether you actually need chat: Rerank and chat endpoints cost more per call. If you're bumping the limit, see whether batch embeddings can replace what you're doing — they handle many texts per quota unit.