Increased Rate of LLM Failure Errors on Agent Platform Calls

Write-up

Root Cause Analysis: RAG Query Latency and Timeouts
Date of Incident: January 10, 2026 – January 13, 2026
Document Date: January 13, 2026

Summary:
Between January 10 and January 13, the platform experienced intermittent periods of increased latency and timeouts affecting Retrieval-Augmented Generation (RAG) queries. This resulted in delayed responses and, in some instances, a failure to retrieve knowledge base documentation during conversations. The issue has been fully resolved following infrastructure upgrades.

Impact Assessment

User Experience: During the incident windows, some users experienced slower response times. In a subset of cases, the system could not retrieve the necessary context from the knowledge base, leading to generic responses or conversation errors.
Scope: The incident occurred in three distinct waves over the course of three days, with the most significant impact the first day.
Current Status: All systems are operational, and performance has returned to normal baselines.

Root Cause
The root cause of the latency was resource contention on the primary database cluster.

Trigger: A routine index update was initiated to accommodate recent data changes.
Contributing Factor: Due to significant growth in the volume of stored conversations and metadata, the index rebuild process consumed excessive CPU and RAM resources.
Architecture Constraint: At the time of the incident, RAG resources were shared inside the same resources. This starved live RAG queries of necessary compute resources, causing the observed timeouts.

Resolution and Recovery
Our engineering team took the following steps to mitigate and resolve the issue:

Resource Upscaling: Initially, the database host capacity was increased to alleviate immediate CPU pressure.
Infrastructure Improvements: To provide a permanent fix, we configured dedicated resources to ensure live RAG search was never starved of compute.
Completion: The new configuration was successfully rebuilt by January 13, 06:03 UTC, at which point all latency issues ceased.

Corrective and Preventive Measures

Completed:

Architecture Isolation: Configured dedicated resources to RAG tasks to ensure background maintenance does not impact live traffic.

Moving Forward:

Added redundancy: Implementing database improvements to distribute data storage and processing, improving scalability as data grows.
Enhanced Monitoring: Deploying granular alerts for RAG performance metrics (CPU/RAM/Disk) to identify resource contention before it impacts users.
Vendor Collaboration: Working with our database provider to ensure smooth index building in the future.

Write-up

Increased Rate of LLM Failure Errors on Agent Platform Calls

Degraded performance

View the incident

Root Cause Analysis: RAG Query Latency and Timeouts
Date of Incident: January 10, 2026 – January 13, 2026
Document Date: January 13, 2026

Impact Assessment

User Experience: During the incident windows, some users experienced slower response times. In a subset of cases, the system could not retrieve the necessary context from the knowledge base, leading to generic responses or conversation errors.
Scope: The incident occurred in three distinct waves over the course of three days, with the most significant impact the first day.
Current Status: All systems are operational, and performance has returned to normal baselines.

Root Cause
The root cause of the latency was resource contention on the primary database cluster.

Trigger: A routine index update was initiated to accommodate recent data changes.
Contributing Factor: Due to significant growth in the volume of stored conversations and metadata, the index rebuild process consumed excessive CPU and RAM resources.
Architecture Constraint: At the time of the incident, RAG resources were shared inside the same resources. This starved live RAG queries of necessary compute resources, causing the observed timeouts.

Resolution and Recovery
Our engineering team took the following steps to mitigate and resolve the issue:

Resource Upscaling: Initially, the database host capacity was increased to alleviate immediate CPU pressure.
Infrastructure Improvements: To provide a permanent fix, we configured dedicated resources to ensure live RAG search was never starved of compute.
Completion: The new configuration was successfully rebuilt by January 13, 06:03 UTC, at which point all latency issues ceased.

Corrective and Preventive Measures

Completed:

Architecture Isolation: Configured dedicated resources to RAG tasks to ensure background maintenance does not impact live traffic.

Moving Forward:

Added redundancy: Implementing database improvements to distribute data storage and processing, improving scalability as data grows.
Enhanced Monitoring: Deploying granular alerts for RAG performance metrics (CPU/RAM/Disk) to identify resource contention before it impacts users.
Vendor Collaboration: Working with our database provider to ensure smooth index building in the future.