Foursquare – Powering Reliable Data Platforms with Deductive AI

Executive Summary

Foursquare, a leader in location technology, processes billions of daily geolocation signals across a complex data ecosystem. The company faced significant challenges in diagnosing and resolving Spark job failures and performance regressions, leading to SLA breaches and escalating costs. Through a close partnership with Deductive AI, Foursquare integrated an AI-powered reasoning layer into its data platform—aligning Deductive’s reasoning capabilities with Foursquare’s deep knowledge of its data and infrastructure.

While the industry raced toward coding agents, our partnership with Deductive took a different path: bringing agentic AI directly into data operations. Together, we built a system that unifies our telemetry streams with Deductive’s reasoning engine, creating an AI-powered frontline agent that proactively manages our data pipelines—detecting anomalies, diagnosing root causes, and accelerating resolution.
- Vikram Gundeti, CTO, Foursquare

Across several key pipelines, this collaboration resulted in up to a 90% reduction in time to diagnose and resolve Spark job issues (from hours or days to under 10 minutes), a 60% reduction in EMR compute costs, and over $275,000 in annual savings. This joint case study reflects how modern AI systems can be shaped through partnership and iteration to fit the realities of production-scale environments.

Foursquare at Scale

Foursquare, a pioneer in location technology, processes billions of geolocation signals daily to fuel insights across marketing, logistics, and customer engagement. Its platform powers attribution, segmentation, and targeting pipelines for some of the world’s most data-driven enterprises.

To maintain this precision, Foursquare operates tens of thousands of data pipelines across multiple cloud environments. Each pipeline directly affects downstream business outcomes – data freshness, model accuracy, and cost efficiency. As the ecosystem continues to grow, ensuring reliability and performance across these interdependent systems has become a central challenge.

The Challenge: Diagnosing the Undiagnosable

Foursquare’s data platform team manages a deeply interconnected stack built on Apache Airflow, AWS EMR, S3, Athena, and the Hive Metastore. When Spark jobs failed or query runtimes regressed, diagnosing the cause required combing through hundreds of GBs of logs, metrics, and job histories scattered across tools. The team faced three recurring pain points:

Keeping data landing times within strict SLA windows
Maintaining cost-efficient query plans and cluster configurations
Navigating highly coupled dependencies where small inefficiencies cascaded into large downstream effects

Despite mature observability systems, the missing layer was reasoning—the ability to connect telemetry signals to the code and configuration context. Finding the why behind incidents often took hours, and sometimes, the root cause remained hidden. Recognizing this gap, Foursquare’s team worked closely with Deductive to define the signals, integrations, and workflows that would make automated reasoning practical and effective for their environment.

The Solution: Building an AI-Powered Data Platform On-Call

To overcome these challenges, Foursquare partnered with Deductive AI to build an AI-powered Data Platform On-Call. This reasoning layer acts like a seasoned data engineer embedded directly into their infrastructure. Deductive integrated deeply with:

GitHub for code, deployments, and configuration context
Apache Airflow, AWS EMR, Athena, S3, Hive Metastore, and Prometheus for real-time telemetry
Slack for on-call collaboration and feedback loops

With these integrations, Deductive unified code, data, and operational telemetry into a single reasoning layer. It could trace incidents across dependencies, interpret query plans, detect data drift, and surface actionable insights in real time, effectively serving as the AI teammate supporting Foursquare’s on-call engineers. The result was a unified reasoning layer that reflected both teams’ engineering input, combining Deductive’s AI reasoning with Foursquare’s operational context and expertise.

How It Works: From Signal to Suggestions in Minutes

When a Spark job failed or ran anomalously long, Deductive automatically correlated the surrounding context, including logs, metrics, query plans, and recent commits, to generate a precise hypothesis. As alerts fired from Foursquare's observability stack (Airflow, AWS CloudWatch, PagerDuty, Slack), Deductive got to work within seconds, automatically analyzing both telemetry and code to surface the root cause. Foursquare’s engineers actively guided this process by providing real-world feedback and refinement on the system’s early analyses, helping shape how it prioritized and interpreted telemetry.

For each investigation, Deductive’s on-call agent analyzes the Spark execution plan, the associated driver and executor logs/metrics, and the pipeline code to understand why the job is expensive, not just where it failed. In one investigation, Deductive identified a costly shuffle operation as the primary contributor to high-impact Spark stages. The system cross-verified this finding by analyzing the table schema and layout in the Hive Metastore, then calculating disk pressure, executor utilization, memory footprint, and shuffle partitions. Using this context, Deductive proposed targeted optimizations, validated them against historical runs, and ruled out configurations that could cause regressions. Working alongside Deductive, Foursquare’s engineers continuously validated these insights and verified improvements through production testing.

Engineers received this analysis directly in Slack, complete with linked traces, configuration diffs, and optimization recommendations. This feedback loop between teams enabled the system to mature quickly and adapt to Foursquare’s evolving data platform. What previously required hours of manual debugging was reduced to clear, actionable insights delivered in minutes.

‍

Deductive’s learning loop continuously improves through feedback gathered from Slack interactions and user confirmations. Over time, it refines its understanding of Foursquare’s data patterns, query plans, and operational history, becoming a living, learning on-call partner that understands both the code and the data behind it. Across several key pipelines, this automation has led to a 90% reduction in time to diagnose and resolve Spark job issues (from days to under ten minutes) and a 60% reduction in EMR compute costs, resulting in over $275,000 in annual savings.‍

Deductive AI has completely changed how our data platform team operates. What used to take hours of guesswork now takes minutes, and in many cases, the insights are ones we wouldn’t have reached manually. We have already realized significant cost savings in the ballpark of $275,000/year
- Changliang Cao, Data Platform Lead, Foursquare

Deductive’s Take: Why This Partnership Matters

Through reinforcement learning and human-in-the-loop design, Deductive transformed on-call firefighting into proactive optimization. Foursquare’s engineers now spend more time improving systems and less time reacting to them. The collaboration with Foursquare exemplifies the next phase of reliability engineering, where AI doesn’t just monitor systems, it learns them. Deductive’s multi-agentic reasoning, fused with Foursquare’s scale, demonstrates that modern data platforms can achieve both reliability and velocity without trade-offs.

Foursquare’s collaboration elevated Deductive’s reasoning capabilities in meaningful ways, and their success shows how AI can elevate observability into understanding. Their team’s openness to embed reasoning into their workflows pushed Deductive’s capabilities further than ever.
– Pratyush Verma, Founding Engineer, Deductive AI

How Foursquare Keeps Billions of Geolocation Data Signals Flowing with Deductive

Executive Summary

Foursquare at Scale

The Challenge: Diagnosing the Undiagnosable

The Solution: Building an AI-Powered Data Platform On-Call

How It Works: From Signal to Suggestions in Minutes

‍

Deductive’s Take: Why This Partnership Matters

‍

Get ready to redefine the way your developers and SREs root cause software issues

Executive Summary

Foursquare at Scale

The Challenge: Diagnosing the Undiagnosable

The Solution: Building an AI-Powered Data Platform On-Call

How It Works: From Signal to Suggestions in Minutes

‍

Deductive’s Take: Why This Partnership Matters

‍

Get ready to redefine the way your developers and SREs root cause software issues

Get ready to redefine the way your developers and SREs root cause software issues