Tackling Token Limits and Optimizing AI Performance

Mar 05, 2025

In our journey to develop an AI-driven system for analyzing bank earnings call transcripts, one of the biggest challenges was managing token limitations when using transformer models. Balancing context retention, model efficiency, and financial accuracy led us through several iterations—from Flan-T5 to BigBird-Pegasus, and ultimately to GPT-3.5-turbo. This process involved weighing the trade-offs between open-source flexibility and proprietary model constraints while also addressing the impact of token limits on a multi-agent Retrieval-Augmented Generation (RAG) system for financial analysis.

a close up of a game of dominos on a table

The Problem: Token Limits and Multi-Agent Financial Context

Earnings call transcripts are long and detailed, with executives discussing quarterly performance, market risks, and strategic outlooks. Financial analysts and regulators rely on these transcripts to extract insights, but manually reviewing them is time-consuming. To streamline this process, we implemented a multi-agent RAG system designed to retrieve, analyze, and summarize transcript data efficiently.

The system included a master agent that routed queries to bank-specific retrieval agents for JP Morgan and UBS. A document ingestion pipeline processed PDFs, applied sentence-transformer embeddings, and indexed the data into ChromaDB for retrieval. Initially, we used Flan-T5 for text generation due to its strong summarization capabilities, but its 512-token limit proved to be a major bottleneck. Even after retrieving relevant text chunks, we had to split and summarize them further before passing them to the model. This process often led to loss of critical financial context, increased processing complexity, and inefficiencies in how agents passed information between one another.

While Flan-T5 was open-source and easy to deploy, its inability to process longer sections of text in a single pass made it unsuitable for handling long financial documents.

Exploring Alternatives: BigBird-Pegasus and Context Expansion

To overcome these limitations, we experimented with BigBird-Pegasus, a model designed for long-context processing through sparse attention mechanisms. With a 4,096-token capacity, it offered a significant improvement over Flan-T5, allowing for better summarization without excessive truncation.

While BigBird-Pegasus improved long-form summarization and retrieval accuracy, it introduced new challenges. Its higher computational requirements made real-time querying slower, and it required extensive fine-tuning to adapt to financial language. Integrating it into the retrieval workflow also required adjusting how transcript chunks were passed through the system to ensure coherence across multiple agents.

Despite the improvements, some financial details were still lost in edge cases due to summarization artifacts. The model was a step forward but not the ideal solution for balancing token capacity, response coherence, and inference speed.

Transitioning to GPT-3.5: A Trade-Off Between Cost and Performance

After evaluating different options, we transitioned to GPT-3.5-turbo, which offered a significantly higher token limit while maintaining strong retrieval-augmented generation capabilities. The model’s ability to process larger transcript sections in a single pass reduced the need for aggressive chunking and summarization, improving retrieval accuracy and response coherence.

The shift improved efficiency in several ways. The need for extensive pre-processing was reduced, allowing agents to pass longer excerpts without worrying about token constraints. The retrieval process became more accurate since larger context windows helped maintain the integrity of financial insights.

However, the transition came with trade-offs. Unlike Flan-T5 and BigBird-Pegasus, GPT-3.5 is a proprietary model that requires API-based access, adding cost considerations. Optimizing retrieval strategies became critical to minimize unnecessary token usage, and API latency had to be managed to maintain response times. Limited fine-tuning control also meant that adaptation to financial-specific terminology relied on prompt engineering rather than direct model modifications.

Despite these challenges, GPT-3.5 provided the best balance between efficiency, accuracy, and scalability. The cost trade-off was justified by the reduction in pre-processing complexity and the improvement in retrieval quality.

Lessons Learned and Future Considerations

Developing a financial analysis system using multi-agent RAG required careful management of token limitations, retrieval strategies, and agent communication. Open-source models like Flan-T5 offered flexibility but struggled with strict token caps, leading to excessive chunking and loss of important financial details. BigBird-Pegasus improved long-context handling but required significant computational resources and fine-tuning. GPT-3.5-turbo emerged as the most practical choice, reducing the need for aggressive summarization while improving retrieval quality, but at the cost of API usage and limited fine-tuning control.

To further improve efficiency, hybrid approaches could be explored, such as using open-source models for initial document processing and reserving GPT-based systems for more complex summarization tasks. Enhancing agent memory mechanisms could also help retain relevant context across multiple interactions. Another avenue worth exploring is adaptive query expansion, where agents dynamically adjust retrieval scope based on the complexity of the user’s question.

Managing token limitations in a multi-agent RAG system is not just about choosing a model with a higher token limit—it’s about optimizing how agents retrieve, process, and share financial data in a scalable and cost-effective manner.

Sheldon’s Substack

Discussion about this post