Vignesh Ravichandran • Jun 23, 2025

Changing Observability in the Age of AI: How Smart Tools Are Transforming Infrastructure Monitoring

AI-powered observability tools transforming infrastructure monitoring and incident response

The Intelligence Revolution in Infrastructure Monitoring

The world of observability is undergoing a massive transformation, driven by the shift towards smarter, AI-powered solutions. As someone who’s spent over 10 years in infrastructure, primarily focused on databases, I’ve seen first-hand how observability challenges like alert fatigue, fragmented telemetry, and escalating costs continue to plague organizations.

After speaking with several SREs and engineering leaders, it’s clear that the tools they use are evolving—and AI is leading the charge.

This post will explore the key trends in observability, how AI is reshaping the landscape, and what it means for developers, teams, and businesses at large.

What Attracts Developers to Modern Observability Solutions?

Ease of Use and Integration (especially with OpenTelemetry)

Just add a few lines to your manifest and automatically start generating spans and metrics. Telemetry is just there by default.

Modern observability tools are drastically simplifying telemetry collection. The traditional process of instrumenting systems and writing custom code to generate metrics or logs is quickly becoming a thing of the past. With auto-instrumentation, developers no longer need to write extensive code—this eliminates the mental load for developers, ensuring that telemetry is “just there by default.”

For startups, where time is incredibly expensive, this ease of use is invaluable. By removing the complexity of manual instrumentation, developers can focus on building features and delivering value to customers, rather than spending time on monitoring and debugging.

Intelligent Context and Automation

It’s not raw AI—it’s automation intelligence. Tools act as copilots, not replacements.

The shift is moving toward “automation intelligence” rather than raw AI. Tools are now designed to act as “copilots” for engineers, automating tedious tasks, providing immediate context, and narrowing down problem spaces quickly. This is a significant step up from older systems that simply reported raw data without offering any insight.

Modern observability solutions automatically present SLOs (Service Level Objectives) and analyze them when they drift. This enables teams to stay ahead of potential issues and proactively address problems before they escalate.

Cost Efficiency and Data Ownership

Store raw observability data in customer-owned S3 data lakes. Ingestion becomes effectively free.

One of the biggest draws of modern observability solutions is their cost efficiency. Traditional observability tools charge based on the volume of data ingested, which can lead to sky-high costs. However, many newer solutions allow organizations to store raw observability data in customer-owned S3 data lakes.

This model makes ingestion “effectively free” and shifts to a “pay-for-read” pricing structure where businesses only pay for value-added services. This fundamentally changes the cost structure, making observability more accessible and sustainable.

Modern Observability vs. Alternatives

Simplicity and Cost-Effectiveness

Many companies find Datadog or New Relic prohibitively expensive, especially for tracing, and often don’t even utilize advanced features like APM.

When comparing modern observability solutions to traditional, expensive commercial tools, it’s clear that new approaches are both more user-friendly and cost-efficient. This drives them toward OpenTelemetry and open-source backends, which offer flexibility without the hefty price tag.

Beyond Application-Centric Views

Older tools often provide a very application-centric view, which means they focus only on the application itself and ignore the infrastructure it’s running on—such as Kubernetes clusters. This partial view leads to gaps in visibility.

Modern observability solutions, however, aim for comprehensive visibility across the entire stack, from application to infrastructure, ensuring that teams get the full picture.

Actionable Insights vs. Noise

We’re aiming for 83% true positive rate in alerts by learning from past occurrences and suppressing non-actionable noise.

One of the most pressing issues in observability today is alert fatigue. Teams are overwhelmed with countless alerts, many of which are either irrelevant or non-actionable. New observability tools tackle this by focusing on high true positive rates, learning from past occurrences and suppressing non-actionable alerts.

This helps restore trust in the alerting system, ensuring that engineers are not bombarded with noise, but are instead given alerts that are relevant and actionable.

Leveraging Institutional Knowledge

Generic AI models can’t understand your organization’s domain knowledge—historical context, Slack communications, Jira tickets, and postmortem documentation.

Modern observability tools stand apart from generic AI models by learning from an organization’s domain knowledge—including historical context, Slack communications, Jira tickets, and documentation postmortem. This enables tools to provide accurate insights based on the organization’s unique environment and processes, addressing the challenge of critical knowledge residing with specific individuals.

This also speaks to the challenges many startups face when trying to break into the market. Large incumbents like Datadog enjoy a significant moat due to their ease of entry. The “nobody is fired for buying IBM” mentality comes into play here—Datadog, while an exceptional product, benefits from an ease of entry that makes it nearly effortless for startups to set up in just a few clicks.

They can get started quickly and forget about it until they scale and suddenly realize they’re locked into the platform. The fear of vendor lock-in at the CIO/CTO level is a huge barrier to startups that are looking to break through and offer an alternative, and unraveling this fear can be incredibly difficult.

Key Use Cases for AI in Observability

Accelerated Incident Triage and Root Cause Hinting

AI acts as a tier 0 assistant, cutting time to recovery by 80% during those 3 AM pager notifications.

AI-driven observability tools act as a “tier 0” assistant, quickly identifying what’s happening, what’s affected, and what’s causal. By providing this immediate context, AI can cut time to recovery down by 80%, enabling engineers to respond quickly during high-stress situations like a 3 AM pager notification.

Automated SLO Management and Proactive Monitoring

Systems can now automatically compute SLOs from real traffic data and notify SREs when these SLOs drift. By providing analysis of potential issues before they escalate, AI-driven observability tools enable teams to stay ahead of performance degradation, offering a proactive monitoring solution rather than reactive troubleshooting.

Dynamic Playbook Generation and Knowledge Transfer

AI can also generate and update playbooks based on historical failures and informal communication channels like Slack. This ensures that critical operational knowledge is always accessible and up-to-date, helping teams quickly address recurring issues without starting from scratch each time.

Cost Optimization through Data Ownership

By allowing organizations to store telemetry in S3 data lakes, companies can significantly reduce egress costs. This flexible setup enables businesses to integrate various observability tools and AI applications without worrying about vendor lock-in, while still maintaining control over their data and costs.

Synthetic Chaos for Training

Some solutions are taking a proactive approach by introducing chaos studios. These studios generate synthetic telemetry to simulate failures in a safe environment, allowing teams to practice incident response and test observability coverage without the risk of affecting production systems.

Areas for Improvement

Usability of Raw OpenTelemetry

OpenTelemetry setup often turns a ‘5-minute setup’ into several days of work. The documentation can be daunting.

While OpenTelemetry is a powerful tool, its setup and documentation can be daunting for many teams. Productizing the usability of this tool is crucial for wider adoption.

AI Trust and Controllability

Many engineers are hesitant to trust AI with high-stakes tasks like rolling back deployments. Solutions must operate in a copilot mode, allowing for human oversight to build trust in the system before handing over full control.

Multi-Language Support for AI

For global teams, AI models need to be effective across multiple languages. Since much of an organization’s knowledge resides in informal communication channels like Slack chats (which might be in languages like Portuguese or Japanese), AI solutions need to support a broad array of languages to be truly effective.

Integration with Existing Proprietary Stacks

Despite the growing popularity of OpenTelemetry, large enterprises with existing proprietary observability stacks (e.g., Dynatrace) still require integration with these legacy systems for broader adoption. Ensuring seamless integration will be key for organizations looking to make the transition.

Bridging the Context Gap in Centralized Ops

Large organizations often have centralized ops teams that lack sufficient context when troubleshooting issues across thousands of microservices. AI can help bridge this context gap, but solving this challenge requires more than just technical solutions—it involves aligning workflows and responsibilities across teams.

The Future of Intelligent Observability

The trends outlined in this post are already shaping observability across organizations. Companies like Observe, Chronosphere, Baselime, and Better Stack are pioneering innovative approaches to solving these challenges.

The shift toward AI-powered observability isn’t just about better tools—it’s about fundamentally changing how teams interact with their infrastructure. From reducing alert fatigue to enabling proactive monitoring, these solutions are making observability more intelligent, accessible, and cost-effective.

What Are Your Thoughts?

How are you handling observability and incident management in your organization? Have you experienced similar challenges or seen other trends emerging?

The conversations we’ve had with engineering leaders reveal a clear pattern: teams want observability tools that understand their unique context, reduce operational overhead, and provide actionable insights without breaking the bank.

As AI continues to evolve, the observability landscape will likely see even more dramatic changes. The key will be finding the right balance between automation and human control, ensuring that these intelligent systems enhance rather than replace human judgment.

Big thanks to Ishan Shah, Sriram Panyam, and Mohinish Shaikh for sharing their time and insights while shaping this post.

More blogs for you

Rappo • Vignesh Ravichandran • Jun 20, 2025

How Adaptive Used Rappo to Validate Its US Expansion, Product, and Ideal Customer—With Engineering Champions Who've Built It All Before

Discover how Adaptive leveraged Rappo to validate their US market expansion, refine their ideal customer profile, and build product features that matter—through conversations with engineering champions.

Engineering teams discussing product validation and market expansion strategies

Rappo • Vignesh Ravichandran • Jun 18, 2025

Why People Are Still Using Datadog in 2025: Real Conversations and Hard Truths from Engineering Teams

Discover why Datadog remains the dominant monitoring choice for engineering teams in 2025, despite its costs and limitations, through honest conversations with seasoned engineers.

Engineering teams discussing Datadog monitoring solutions and alternatives