The Qive team came to share here

mansplainer
João Brito

TL;DR: We took a PHP monolith (Zend), applied OpenTelemetry focusing on tracing, standardized spans/attributes, integrated with Grafana Tempo and Loki, and collected measurable gains (p95 reduction and less instability). Here are the practical decisions, pitfalls, and checklist for you to repeat — without panic and without rewriting the system.
Context: why stir up a hornet's nest?
Every team has that service that "works... until it doesn't". In our case, a PHP monolith with legacy dependencies, critical endpoints, and erratic behavior under load. We needed end-to-end visibility to answer three simple questions: what is slow, why is it slow, and how much does it cost to speed it up. Instead of a closed APM, we bet on OpenTelemetry (OTel) to capture traces, metrics, and logs in an open and portable way.
Why OpenTelemetry in PHP (and legacy systems)?
Market standard: SDKs and shared semantics with other languages.
Neutrality: didn't lock us in with a vendor; we could send data to Grafana Tempo without friction.
Controllable cost: sampling, filtering, and enrichment gave us the autonomy to balance noise, latency, and the storage bill.
“But it's Zend, not Laravel…”
Yes, and that's fine. We started with auto-instrumentation for HTTP, DB, and some common frameworks; where there was no ready hook, we added manual spans in the most critical flows. The secret was to standardize names/attributes so that the charts made sense to both devs and SREs.
Costs, overhead, and what really matters
Overhead: acceptable when sampling is well-calibrated. "Hot" endpoints require lower sampling.
Storage: retention by criticality. Incidents get longer retention via tags.
CPU/memory: watch out for SDK serialization and flushing; we preferred gRPC due to its footprint and native backoff.
Results: “almost APM”, with an open stack
Drop in p95 in critical routes after identifying 2 N+1 queries and a misconfigured cache.
Fewer ghost incidents: log↔trace correlation reduced MTTR.
Better DX: devs run the service with tracing in a local environment and replicate problems.
Pitfalls and how we avoided them
Noise: too many spans can be exhausting. Use route filters and limit attributes.
Cardinality: beware of highly variable attributes (e.g., long IDs). Use hashing or bucketization.
Data leaks: apply redactors in the Collector; treat sensitive headers as PII.
Old debt: don't try to "fix the world" in one sprint; attack the highest-impact flows first.
Important Links:
- Marcelo França - https://www.linkedin.com/in/marceloluizfranca
- Francisco Rodrigues - https://www.linkedin.com/in/fcoedno
- Inspiring article - https://medium.com/engenharia-arquivei/instrumente-sua-aplica%C3%A7%C3%A3o-php-com-opentelemetry-cb3460a64d04
- Learn about Qive - https://qive.com.br/institucional/
- Opentelemetry PHP - https://opentelemetry.io/docs/languages/php/
- João Brito - https://www.linkedin.com/in/juniorjbn/
Conclusion
The moral of the story is simple: observability is a product, not a project. When we treat tracing, metrics, and logs as part of the development lifecycle, difficult decisions become cheaper. And yes, it is possible to modernize a legacy system without rewriting everything — one span at a time.
If you enjoyed this, listen to the full episode of Kubicast #191 and tell us how you are instrumenting your services. We want your war stories and, of course, your victories!
Newsletter Getup.
Atualizações sobre Kubernetes e Software Supply Chain Security todos os meses.
Operating Kubernetes in production for more than 13 years. With Quor, this experience extends to software supply chain security as well.
GET UP
© Getup · 2026
