The ProblemIt usually happens when you're finally testing your app with real-world data. You feed a long PDF or a week-long chat history into the model, and the OpenAI API rejects it instantly. This context_length_exceeded error isn't a bug in your code logic; it's a hard physical limit. Every model has a fixed 'memory' capacity. For GPT-4o, that limit is 128,000 tokens—roughly 300 pages of text—including both your input and the model's generated response.
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 145823 tokens.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
Why the API Rejects Your RequestOpenAI models like GPT-4o and GPT-4-turbo use a 128k token window. If you send 145,823 tokens, the API won't try to process a partial response. It simply shuts down the request to prevent inefficient compute usage. This typically occurs because of 'context bloat'—retaining too many previous chat turns or dumping raw, uncleaned data directly into the prompt.
Step 1: Count Tokens Locally Before SendingReliable LLM applications never 'guess' if a prompt will fit. Instead, they validate the token count before the request ever leaves the server. Using the tiktoken library, you can calculate exactly how the model sees your text. Note that GPT-4o uses the o200k_base encoding, while earlier GPT-4 models use cl100k_base.
import tiktoken
def num_tokens_from_messages(messages, model="gpt-4o"):
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
num_tokens = 0
for message in messages:
# Every message follows <im_start>{role/name}\n{content}<im_end>\n
num_tokens += 3
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += 1
num_tokens += 3 # Every reply is primed with <im_start>assistant
return num_tokens
Step 2: Apply a Sliding Window StrategyWhen the conversation gets too long, you have to decide what to forget. A 'sliding window' approach keeps the essential system instructions but drops the oldest user/assistant exchanges. This maintains the immediate context of the conversation while staying under the 128k limit.
def trim_messages(messages, max_tokens=120000, model="gpt-4o"):
"""Drops the oldest messages until the total is within the safety buffer."""
system_message = [m for m in messages if m['role'] == 'system']
chat_history = [m for m in messages if m['role'] != 'system']
while num_tokens_from_messages(system_message + chat_history, model) > max_tokens:
if len(chat_history) > 1:
chat_history.pop(0) # Remove the oldest exchange
else:
# If one message is still too big, truncate the string directly
chat_history[0]['content'] = chat_history[0]['content'][:5000]
break
return system_message + chat_history
Step 3: Move to RAG for Massive DatasetsTruncation works for chat, but it fails if you need to analyze a 1,000-page technical manual. In that scenario, don't send the manual. Use Retrieval Augmented Generation (RAG) to find the needles in the haystack. Break your data into 500-token chunks, store them in a vector database like Pinecone or ChromaDB, and only inject the top 5 most relevant snippets into your prompt. This keeps your token count low—often under 10,000—while providing the model with the exact facts it needs.
Verification and Safety BuffersNever aim for the exact limit. If a model has a 128,000-token limit and you send 127,990 tokens, the model only has 10 tokens left for its answer. This results in a 'cut off' response. Always leave a safety buffer of at least 10-20% for the output.
# Example: Setting a safe threshold
MAX_ALLOWED = 115000
current_usage = num_tokens_from_messages(messages)
if current_usage > MAX_ALLOWED:
print(f"Trimming: {current_usage} tokens exceeds safety threshold.")
messages = trim_messages(messages, max_tokens=MAX_ALLOWED)
Quick Optimization Tips- Summarize History: Instead of deleting old messages, ask GPT to summarize the first 20 turns into a single paragraph and use that as the new starting point.- Check Model Versions: If you are hitting limits at 8,000 tokens, you might be using the legacy gpt-4 or gpt-3.5-turbo. Switch to gpt-4o for the full 128k window.- Clean Your Data: Remove unnecessary whitespace, HTML tags, or metadata from your inputs to save 5-10% on token costs immediately.