A Retrospective on GenAI Token Consumption and the Role of Caching

Caching is an important technique for enhancing the performance and cost efficiency of diverse cloud native applications, including modern generative AI applications. By retaining frequently accessed data or the computationally expensive results of AI model inferences, AI applications can significantly reduce latency and also lower token consumption costs. This optimization allows systems to handle larger workloads with greater cost efficiency, mitigating the often overlooked expenses associated with frequent AI model interactions. 

This retrospective discusses the emerging coding practices in software development using AI tools, their hidden costs, and various caching techniques directly applicable to reducing token generation costs.

Shift in Writing Code With the Use of Copilots

Instead of solely focusing on writing every line of code, developers are increasingly leveraging copilot tools to automate repetitive tasks. Vibe coding refers to a new development approach where developers primarily rely on AI-generated code suggestions without necessarily diving deep into the underlying logic or critically evaluating the output. The perceived initial time savings from vibe coding may evaporate quickly in the long run as the costs associated with fixing issues, managing the codebase, and potentially needing to rewrite significant portions can far outweigh the early efficiency gains. Per TechCrunch, “GitHub Copilot, an AI coding tool offered by Microsoft-owned GitHub, has now reached more than 20 million users. It is used by 90% of the Fortune 100”. Few developer interaction patterns with copilots include:

  1. Code generation: Importing standard libraries, setting up basic class structures, or initializing variables. E.g., “Can you create a class …”, “Can you import a library”, etc.
  2. Bug fixes: Finding bugs in code or runtime, e.g., “Can you fix the key not found error …”
  3. Code optimizations: Copilot might suggest a better solution for an already existing solution, e.g., “Can you reduce the time taken by this for loop …” etc.
  4. Documentation: Authoring comments. E.g., “Can you write an example input and output for this method …” etc.
  5. Test cases: For a given function or component, Copilot might generate a similar set of test cases, e.g., “Convert the Python tests to scripts …” etc.

Hidden Costs Involved in Such Coding Practices

The cost implications of using AI tools such as copilots can be substantial, particularly in terms of token generation. Each interaction with the AI model, whether providing context (input tokens) or receiving generated code (output tokens), incurs a cost based on the number of tokens processed. While AI tools can seem to accelerate development initially, vibe coding can lead to longer, more verbose prompts if the developer isn’t precise in their requests, inflating input token consumption.

Many LLMs have context windows that define the amount of text they can process in a single request. If a developer asks for a small detail from a large document, but the system’s design necessitates sending the entire document as part of the input, then the token cost can increase. For example, extracting a single function definition from a 1500-line Python file or summarizing a short paragraph from a lengthy specification document could involve passing thousands of tokens representing the full document, even if only a small part is relevant.

On another dimension, profound dangers are associated with over-reliance on autonomous AI systems without a thorough understanding of their inherent limitations and the necessity of robust human oversight. E.g., a recent PCMag article details a significant incident where an AI agent unexpectedly deleted a live production database.

Reducing Token Generation Costs Using Caching

1. Prompt Caching

Many LLM providers charge based on the number of tokens processed. By caching frequently used prompts, the system avoids reprocessing the same information repeatedly, directly reducing the number of tokens sent to the API and incurring less cost. As per OpenAI, “This can reduce latency by up to 80% and cost by up to 75%. Discounting for prompt caching is not available on the batch API but is available on the Scale Tier. With Scale Tier, any tokens that are spilled over to the shared API will also be eligible for caching.”

2. Request Caching

AI applications can first check their own cache for an identical request. If a match is found, the cached response can be retrieved, eliminating the need to send it to an AI model. If no match is found, the request is processed, the response is generated, and then stored in the cache for future use. Avoiding redundant model inference reduces the demand for expensive computational resources and can result in significant cost savings, particularly in environments with high API call usage.

3. Semantic Caching

With semantic caching, one can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn’t the same. Unlike traditional caching that relies on exact matches, semantic caching focuses on understanding the meaning behind developer queries. It utilizes AI embedding models to convert text into vector representations (embeddings), allowing systems to identify and reuse responses even if the phrasing differs but the intent remains the same. 

4. Session-Level Caching

Each developer session can have its own dedicated cache. When a developer interacts with the AI model within a given session, relevant data and responses generated during that session can be stored in the session cache. If the developer repeats a query or asks a similar question within the same session, and if the answer is already cached, it can be served instantly, avoiding the need for the AI model to reprocess the request.

5. Output Caching

When an AI model receives a query or a prompt, it processes the input and generates an output or response. This generated output is then stored in a cache, along with the input that generated it. If a similar or identical query is received again, the system first checks the cache for a matching output, saving cost.

Conclusion

By strategically implementing and managing caching AI applications, such as copilots, organizations can achieve substantial cost reductions while simultaneously enhancing performance and scalability. Developers can significantly improve response times, reduce computational overhead, and optimize resource utilization. The ideal caching approach, however, will vary based on the usage requirements, workload patterns, and available infrastructure. 

Similar Posts