Product

The role of logs in making debugging conversational

07/30/2025

7 min read

We solved the "write code" problem with conversational AI. The "understand systems" problem is still stuck in polyglot queries and manually constructing narratives. Consider this contrast:

Writing code (2024): "Create a payment service that handles retries and timeouts" → AI generates implementation with error handling
Debugging code (2024): Improve latency on payment service → Check Datadog for metrics → Switch to Loki for logs → Cross-reference deployment history → Manually correlate timestamps → Build mental model of payment sequence.. And so on

At Resolve AI, we are making Debugging conversational. What we like to call “Vibe Debugging”. It collapses the entire loop of hypothesis -> evidence -> validation into a conversation:

Look at this flow as an example: https://resolve.navattic.com/uuh07j2

Along the way, we realized that in making Vibe Debugging possible, logs represent the most valuable evidence yet the most challenging to navigate.

Why are logs a valuable evidence source?

In conversational debugging, logs represent the highest value because logs contain the ground truth. Metrics tell you what (latency increased), traces show where (bottleneck in service X), but logs explain why (connection pool exhausted). They're how engineers leave debugging breadcrumbs and where the actual failure reasons live. If you were to go through the same example as above manually, you would have to:

Form a hypothesis: "Maybe it's a database issue?"
Switch tools: Jump to database metrics
Craft a query: Learn Grafana's query syntax
Gather evidence: Stare at charts, look for anomalies
Switch context: Jump to application logs
Translate intent: Learn LogQL or whatever your log platform speaks
Correlate manually: Try to connect timestamps across tools
Synthesize information: Build a mental model of what happened
Repeat: Hypothesis was wrong, start over

Why is navigating logs a hard technical problem to solve?

Building AI agents for log investigation isn't just "ChatGPT for logs." Logs are fundamentally unstructured. Unlike metrics (time series) or traces (structured events), logs are free-form text with infinite variety. Every service logs differently. This creates a paradox: the most valuable debugging information is trapped in the least structured format.

Making logs conversational requires solving problems that traditional log analysis tools sidestep entirely.

You need to translate intent to query: Every log investigation starts with a human question: "Why did checkout break?" But to answer that question, you need to become a translator:
- Datadog: service:payment-service status:error @timestamp:[now-1h TO now]
- Loki: {app="payment"} |= "error" | json | __error__=""
- Splunk: index=production source=payment earliest=-1h@h | search level=ERROR
You're fighting semantic differences. Is it service.name or app or component? Does "error" mean log level, HTTP status, or exception presence?
Querying logs is not as simple as dumping them as context to an LLM. To understand the scale of this challenge, consider that even Text2SQL (converting natural language to database queries) remains largely unsolved despite years of research. The best GPT-4 systems today achieve only ~60% accuracy ¹ on structured database queries with well-defined schemas. If you are building AI agents that run iterations and error correction on top of such data, these systems can require up to 10 attempts² per query while still struggling with complex joins.
- If AI can't reliably handle structured database queries, debugging distributed systems with unstructured logs becomes exponentially harder. You're essentially doing Text2SQL on millions of unstructured text entries, inconsistent formats across services, no predefined schema, and temporal correlations spanning hours or days.
Real incidents span services and time. When you ask "Why did checkout break?", you are correlating error patterns across multiple services and building temporal causality. Consider this debugging scenario:

14:23:15 payment-service: ERROR Connection timeout to auth-db
14:23:15 auth-service: INFO Processing token validation  
14:23:16 payment-service: WARN Retrying connection to auth-db
14:23:17 database-pool: ERROR Max connections reached (100/100)
14:23:18 payment-service: ERROR Transaction failed: unable to validate auth

You immediately see the story: auth service is overwhelming the database, causing payment failures. But extracting that narrative requires:

Temporal correlation across multiple services
Causal reasoning about database connection limits
Domain knowledge about how payment validation works
Pattern recognition that "connection timeout" + "max connections" = resource exhaustion

Every organization and sometimes even teams within the same organization have unique logging patterns. Field names, error formats, service naming. Learning your specific patterns to construct a coherent narrative is challenging. For example: your "payment-svc" and "billing-service" are related, that connection timeouts in your auth service usually indicate database issues, that certain error codes are noise while others are critical.
Production logs generate millions of entries. Feeding everything to an LLM, you risk hallucination and will likely run out of LLM context window. On the other hand, sampling risks missing the one critical error that explains everything.
Lastly, the hardest part isn't finding errors. It is understanding which error caused the cascade. A memory leak in service A might trigger timeouts in service B, which overwhelms service C. Traditional correlation fails here. You need systems that understand how failures propagate through your specific architecture

How are we using logs to make debugging conversational

At Resolve AI, we are building a multi agent system that provides an interface to your production systems. The goal is to reimagine how we as engineers interact with our complex production systems.

![][image1]

Our multi agent system uses
- Knowledge agents to learn how your production systems operate. As part of their objective, they combine structured metrics, semi-structured traces, unstructured logs or even visual dashboards into coherent understanding.
- Reasoning agents learn your specific system's patterns, not just general debugging heuristics. When you ask a question, they pursue multiple hypotheses and correlate evidence from the right data sources – showing the evidence trail so you can validate conclusions and learn from investigations.
- Action agents can understand and operate your stack based on the conversation you are having with Resolve AI. They can interpret Grafana dashboards, follow your conventions to generate code suggestions, Git commits, or PRs in Cursor.
- Learning agents constantly learn from every interaction, investigation, or outcomes to build institutional memories. Resolve AI is designed to continuously evolve as you use it

If building multi-agent systems that can reason about distributed infrastructure at scale excites you, we'd love to talk. We're looking for engineers who want to define what the next generation of agentic AI looks like for software engineering.

About Resolve AI

Resolve AI is the agentic AI company for software engineering founded by the co-creators of OpenTelemetry. By combining our deep expertise in building developer tools and observability with state-of-the-art agentic AI, our mission is to increase engineering velocity by transforming the way engineers build, deploy, and maintain real-world software systems.

With Resolve AI, customers like Datastax, Tubi, and Rappi, have increased engineering velocity and systems reliability by putting machines on-call for humans and letting engineers just code. Interested in learning more about our Agentic AI approach to production systems? Say hello.

¹ DIN-SQL + GPT-4 achieves 60.0% exact set match accuracy on the Spider benchmark (Yale Spider Leaderboard, 2024). While execution accuracy can reach 85%+, exact match—getting the SQL query precisely right—remains challenging even for state-of-the-art models on structured database queries.

² Agentic Text2SQL systems can require "up to 10 iterations to refine a SQL query" with multi-stage workflows involving schema linking, candidate generation, self-correction, and evaluation (Hexgen-Text2SQL research, 2024). This iterative approach increases computational complexity while still struggling with accuracy on complex queries.

Kent Willis

Director of Engineering

@ Yelp

Content

Why are logs a valuable evidence source?
Why is navigating logs a hard technical problem to solve?
How are we using logs to make debugging conversational
About Resolve AI

Kent Willis

Director of Engineering

@ Yelp

Technology

AI SRE vs. ChatGPT: Why Production Needs Purpose-Built AI

Discover why ChatGPT and out-of-the-box LLMs can't handle production incidents. Learn how true AI SREs use multi-agent systems to deliver root cause analysis in minutes, not hours.

Product

Is Vibe debugging the answer to effortless engineering?

Top 5 AI SRE product comparisions

How Resolve Ai differentiates from the rest.

Social

Shaping the future of software engineering

Join our community