EN

KUBICAST #191 - OpenTelemetry in PHP with QIVE

The Qive team came to share here

mansplainer

João Brito


TL;DR: We took a PHP monolith (Zend), applied OpenTelemetry focusing on tracing, standardized spans/attributes, integrated with Grafana Tempo and Loki, and collected measurable gains (p95 reduction and less instability). Here are the practical decisions, pitfalls, and checklist for you to repeat — without panic and without rewriting the system.

Context: why stir up a hornet's nest?

Every team has that service that "works... until it doesn't". In our case, a PHP monolith with legacy dependencies, critical endpoints, and erratic behavior under load. We needed end-to-end visibility to answer three simple questions: what is slow, why is it slow, and how much does it cost to speed it up. Instead of a closed APM, we bet on OpenTelemetry (OTel) to capture traces, metrics, and logs in an open and portable way.

Why OpenTelemetry in PHP (and legacy systems)?

  1. Market standard: SDKs and shared semantics with other languages.

  2. Neutrality: didn't lock us in with a vendor; we could send data to Grafana Tempo without friction.

  3. Controllable cost: sampling, filtering, and enrichment gave us the autonomy to balance noise, latency, and the storage bill.

“But it's Zend, not Laravel…”

Yes, and that's fine. We started with auto-instrumentation for HTTP, DB, and some common frameworks; where there was no ready hook, we added manual spans in the most critical flows. The secret was to standardize names/attributes so that the charts made sense to both devs and SREs.

Costs, overhead, and what really matters

  • Overhead: acceptable when sampling is well-calibrated. "Hot" endpoints require lower sampling.

  • Storage: retention by criticality. Incidents get longer retention via tags.

  • CPU/memory: watch out for SDK serialization and flushing; we preferred gRPC due to its footprint and native backoff.

Results: “almost APM”, with an open stack

  • Drop in p95 in critical routes after identifying 2 N+1 queries and a misconfigured cache.

  • Fewer ghost incidents: log↔trace correlation reduced MTTR.

  • Better DX: devs run the service with tracing in a local environment and replicate problems.

Pitfalls and how we avoided them

  • Noise: too many spans can be exhausting. Use route filters and limit attributes.

  • Cardinality: beware of highly variable attributes (e.g., long IDs). Use hashing or bucketization.

  • Data leaks: apply redactors in the Collector; treat sensitive headers as PII.

  • Old debt: don't try to "fix the world" in one sprint; attack the highest-impact flows first.



Important Links:

- Marcelo França - https://www.linkedin.com/in/marceloluizfranca

- Francisco Rodrigues - https://www.linkedin.com/in/fcoedno

- Inspiring article - https://medium.com/engenharia-arquivei/instrumente-sua-aplica%C3%A7%C3%A3o-php-com-opentelemetry-cb3460a64d04

- Learn about Qive - https://qive.com.br/institucional/

- Opentelemetry PHP - https://opentelemetry.io/docs/languages/php/

- João Brito - https://www.linkedin.com/in/juniorjbn/

Conclusion

The moral of the story is simple: observability is a product, not a project. When we treat tracing, metrics, and logs as part of the development lifecycle, difficult decisions become cheaper. And yes, it is possible to modernize a legacy system without rewriting everything — one span at a time.

If you enjoyed this, listen to the full episode of Kubicast #191 and tell us how you are instrumenting your services. We want your war stories and, of course, your victories!

Newsletter Getup.

Atualizações sobre Kubernetes e Software Supply Chain Security todos os meses.

Operating Kubernetes in production for more than 13 years. With Quor, this experience extends to software supply chain security as well.