How Kumo AI and Deductive AI Are Partnering to Build Reliable Model Training Workflows

Executive Summary

Kumo AI, a platform for building predictive models from relational data, faces significant challenges in diagnosing failures within its complex model training workflows. These workflows, orchestrated by temporal.io, are critical for delivering accurate predictions to customers. When failures occur, Kumo's engineers can end up spending hours manually tracing nested activities and correlating telemetry data to find the root cause, and slowing down resolution. To address this, Kumo is partnering with Deductive AI to integrate an AI-powered on-call agent that automates root cause detection. The AI-powered agent is designed to automatically trace workflow graphs, correlate data from multiple systems, and help pinpoint the root cause of failures within minutes. The goal of this collaboration is to reduce triage time and improve visibility into Kumo’s ML operations, allowing engineers to focus more on resolution than diagnosis.

Kumo and Deductive are exploring how reasoning agents can be embedded into model-training systems to automatically diagnose failures. It’s an early step toward a new class of intelligent infrastructure that helps engineers anticipate issues before they escalate.
- Hema Raghavan, VP of Engineering, Kumo AI

‍

Diagnosing Model-Training Failures in Minutes with AI-Powered Root Cause Analysis

Kumo AI is transforming how businesses leverage their data by enabling customers to build powerful prediction and embedding models directly from relational data. Kumo users get instantaneous predictions directly from their data warehouse, without the heavy lift of feature engineering or complex ML pipeline setup. Behind the scenes, Kumo orchestrates complex, multi-step training workflows to train models for their customers. These workflows manage dependencies, retries, and nested activities across multiple components. Ensuring these workflows run reliably is critical to delivering accurate predictions for customers. Even though failures are rare, tracing their root cause remains complex. This ongoing challenge is what motivated Kumo to explore how Deductive could accelerate diagnosis and recovery, ultimately strengthening reliability and trust.

The Challenge: Understanding Why Training Workflows Fail

Kumo orchestrates training jobs using temporal.io, a platform for defining multi-step workflows that are represented as directed graphs. When a training job fails, it is often due to a deeply nested child activity in the graph, which causes an error to cascade upward and bring down the entire workflow.

Debugging these failures requires Kumo engineers to trace these workflows manually, correlate each step with its telemetry in Grafana/Loki, and determine whether the failure was transient, systemic, or due to invalid input. With multiple retries, nested activities, and dependencies, even identifying the real failure could take hours.

Specialized on-call engineers served as the first line of defence, spending significant time reconstructing failure chains and routing the problem to the correct owners. This triage step became a significant bottleneck, slowing Kumo’s response to customer-impacting issues.

The Solution: Automated Failure Analysis with Deductive AI

To address this bottleneck, Kumo is collaborating with Deductive AI to build an ML platform on-call agent that can automatically investigate failing training workflows. The joint effort explores how such an agent could trace the entire workflow graph, correlate telemetry from multiple systems, and surface the most likely points of failure — helping engineers resolve issues quickly.

Deductive promises to democratise visibility into our training infrastructure that has a steep learning curve today.. We are excited to partner with Deductive and believe it can automatically connect the dots across hundreds of workflow executions and point us to the root cause of any failed training workflow.
- Virajith Jalaparti, Head of Infrastructure, Kumo AI

Through deep integrations with Temporal, Grafana, Loki, and Slack, Deductive fits naturally into Kumo’s existing operational workflow. When a training job fails and an alert is posted to Slack, Deductive is designed to automatically:

Trace the entire Temporal workflow graph recursively to map out every job executed as part of the workflow.
Correlate execution data with logs and metrics from Loki and Grafana to determine the exact job or sub-process that triggered the failure.
Cross-reference Kumo’s codebase and deployment metadata to determine if the issue originated from a code/deployment change, configuration update, or customer error.
Deliver a clear, actionable summary in Slack, along with a link to a detailed investigation report.

Automated incident summary from Deductive AI diagnosing a failed ML training job, pinpointing missing Databricks credentials and mapping the failure cascade across workflows.

The agent continuously improves through feedback loops — learning from confirmed root causes, resolution notes/postmortems, and engineer responses. Over time, Deductive builds an internal model of Kumo’s environment and failure modes, enabling faster, more accurate diagnosis with each incident.

Deductive AI synthesizes telemetry, metrics, and code context from Kumo AI's systems to reveal cross-service dependencies and accelerate root-cause learning.

The Impact

Through this collaboration, Kumo and Deductive aim to drastically reduce triage time and improve visibility into the health of Kumo’s training system. Early results have shown promising improvements in how quickly engineers can understand and respond to failures, with clearer explanations and better contextual insights.

Deductive has helped me identify the part of our training workflow that led to the specific failure being investigated and route the failure to the right team, without deep investigation!
- Aleksandar S. Sokolovski, Technical Program Manager, Kumo AI

Beyond time savings, Deductive brings transparency and predictability to complex ML operations. Engineers can now see why a training workflow failed – not just where – turning reactive debugging into proactive reliability engineering.

Deductive’s Take: AI Debugging for AI Workloads

Kumo represents the future of AI infrastructure – where the systems training machine learning models are themselves assisted by intelligent reasoning agents. Watching Deductive diagnose failures in real time across Temporal workflows feels like seeing AI debug AI.
- Madeline Cripps, Founding Engineer, Deductive AI

Kumo’s model-training pipelines exemplify the complexity of modern AI infrastructure – multi-layered, dynamic, and data-driven. By embedding Deductive’s reasoning agent directly into these systems, Kumo transformed the debugging process into an AI-assisted investigation loop that learns continuously, enabling its engineers to move from firefighting to foresight.