The Error
cohere.errors.too_many_requests_error.TooManyRequestsError: status_code: 429, body: You are using a Trial key, which is limited to 10 API calls / minute. Upgrade to a Production key to increase your rate limit.
Ten calls per minute. That's the hard ceiling on a Cohere Trial key, and the SDK throws TooManyRequestsError the instant you cross it β no automatic retry, no grace period, just a 429.
Root Cause
Cohere Trial keys cap you at exactly 10 API calls per minute. Call number eleven comes back as HTTP 429, immediately. Common culprits:
- Looping over a list of inputs and calling
co.embed()orco.chat()without any delay - Parallel async calls with
asynciothat all fire at once - Running a script multiple times in quick succession during development
- Batch processing that sends all items in a tight loop
Step-by-Step Fix
Option 1 β Add a simple sleep between calls (quickest fix)
Small list, no rush? Drop a 6-second delay between calls. That math works out to exactly 10 calls per minute β right at the Trial key ceiling.
import cohere
import time
co = cohere.ClientV2("YOUR_API_KEY")
texts = ["text one", "text two", "text three"] # your inputs
for text in texts:
response = co.embed(
texts=[text],
model="embed-english-v3.0",
input_type="search_document",
)
print(response.embeddings)
time.sleep(6) # 10 calls/min = 1 call per 6 seconds
Option 2 β Retry with exponential backoff (recommended for production-like code)
Rather than letting a 429 crash your script, catch the error and retry with increasing wait times. Start at 10 seconds and double each attempt: 10s β 20s β 40s β 80s β 160s.
import cohere
import time
from cohere.errors import TooManyRequestsError
co = cohere.ClientV2("YOUR_API_KEY")
def embed_with_retry(text: str, max_retries: int = 5) -> list:
wait = 10 # seconds
for attempt in range(max_retries):
try:
response = co.embed(
texts=[text],
model="embed-english-v3.0",
input_type="search_document",
)
return response.embeddings[0]
except TooManyRequestsError:
if attempt == max_retries - 1:
raise
print(f"Rate limited. Waiting {wait}s before retry {attempt + 1}/{max_retries}...")
time.sleep(wait)
wait *= 2 # exponential backoff: 10s β 20s β 40s
return []
# Usage
result = embed_with_retry("Hello, world")
print(result)
Option 3 β Batch inputs into a single call
Here's the biggest lever available to you: the embed endpoint accepts a full list of texts in one shot. Passing 50 texts in a single co.embed() call uses the same 1 quota unit as passing just 1. Send everything at once instead of looping item by item.
import cohere
co = cohere.ClientV2("YOUR_API_KEY")
texts = [
"First document to embed",
"Second document to embed",
"Third document to embed",
# ... up to hundreds of texts in one call
]
response = co.embed(
texts=texts,
model="embed-english-v3.0",
input_type="search_document",
)
for i, embedding in enumerate(response.embeddings):
print(f"Text {i}: {len(embedding)} dimensions")
Each co.embed() call counts as one API call regardless of how many texts you pass in (up to the token limit). That means 100 texts costs 1 quota unit instead of 100. Use this before reaching for any sleep or retry logic.
Option 4 β Use a token bucket / rate limiter
Running async code or a more complex pipeline? The ratelimit library lets you declare the cap once as a decorator and forget about it.
pip install ratelimit
import cohere
from ratelimit import limits, sleep_and_retry
co = cohere.ClientV2("YOUR_API_KEY")
CALLS_PER_MINUTE = 9 # stay under 10 to give headroom
ONE_MINUTE = 60
@sleep_and_retry
@limits(calls=CALLS_PER_MINUTE, period=ONE_MINUTE)
def rate_limited_chat(message: str) -> str:
response = co.chat(
model="command-r-plus",
messages=[{"role": "user", "content": message}],
)
return response.message.content[0].text
# Now you can call this in a loop without worrying about 429s
for msg in ["What is Python?", "Explain APIs", "What is embeddings?"]:
print(rate_limited_chat(msg))
Option 5 β Upgrade to a Production key
Need real throughput? The Trial key won't get you there. Go to dashboard.cohere.com β API Keys β create a Production key. Production keys support thousands of calls per minute depending on your plan β required for anything beyond local testing.
Verify the Fix
Paste this smoke test to confirm your fix actually holds. It sends 12 calls β two over the per-minute limit β with a 6-second gap between each one:
# Smoke test β 12 calls at 10/min to verify the rate limiter holds
import cohere
import time
from cohere.errors import TooManyRequestsError
co = cohere.ClientV2("YOUR_API_KEY")
success = 0
for i in range(12):
try:
co.embed(
texts=[f"test text {i}"],
model="embed-english-v3.0",
input_type="search_document",
)
success += 1
time.sleep(6) # 6s gap = 10 calls/min
except TooManyRequestsError as e:
print(f"Still rate limited at call {i}: {e}")
break
print(f"{success}/12 calls succeeded")
Seeing 12/12 calls succeeded means you're clear.
Tips
- Check your key type first: Head to
dashboard.cohere.comβ API Keys. Trial keys are labeled "Trial". If yours already shows "Production" and you're still hitting 429s, something else is causing the limit. - Batch before you sleep: Always try batching multiple texts into one
co.embed()call before reaching fortime.sleep(). One call with 100 texts is far more efficient than 100 calls with one text each. - Target 9 calls/minute, not 10: Give yourself a one-call buffer to account for clock drift and any background requests your code might trigger invisibly.
- Log your waits during development: Print something before each
time.sleep(). Accidental loops become obvious β instead of the script silently hanging, you'll see exactly when it's throttling itself. - Check whether you actually need chat: Rerank and chat endpoints cost more per call. If you're bumping the limit, see whether batch embeddings can replace what you're doing β they handle many texts per quota unit.

