Observability Beyond Metrics: Logging, Tracing, and Context

Samalan Team

April 5, 2026

10 min read

Observability

Observability Beyond Metrics: Logging, Tracing, and Context

You have great metrics dashboards. CPU, memory, requests per second. But when something goes wrong, you're blind.

Observability isn't just metrics. It's metrics + logs + traces + context. This guide shows you how to implement real observability.

The Three Pillars of Observability

Pillar 1: Metrics (What's Happening?)

System resource usage

Application throughput

Business metrics

Error rates

**Example:** "Error rate went from 0.1% to 2.5%"

Pillar 2: Logs (What Details?)

Structured logs with context

Error stack traces

Transaction logs

Audit logs

**Example:** "User 42 received error 'payment_failed' at 2026-04-05 14:23:15.234 UTC"

Pillar 3: Traces (Why Did This Happen?)

Request journey through services

Latency distribution

Service dependencies

Bottlenecks

**Example:** "Request spent 800ms in database service, 100ms in cache service, 50ms in API service"

Implementing Effective Logging

Most teams log too much or too little.

What to Log

**DO log:**

Errors and exceptions (with stack trace)

Important state changes

User actions (if appropriate)

External API calls and responses

Debugging information

**DON'T log:**

Every function entry/exit

Passwords or PII

Irrelevant debug spam

Info you already have in metrics

Structured Logging

Don't log strings. Log structured data.

**Bad:**

ERROR: Failed to process order 12345 from user john@example.com

**Good:**

{

"level": "error",

"timestamp": "2026-04-05T14:23:15.234Z",

"service": "payment-service",

"request_id": "req-98765",

"user_id": 42,

"order_id": 12345,

"error": "payment_provider_timeout",

"error_message": "Stripe API timeout after 30s",

"status_code": 502,

"retry_count": 2

}

Structured logs are searchable, filterable, and analyzable.

Log Levels

**ERROR:** Something failed that shouldn't have

**WARN:** Something unexpected but potentially recoverable

**INFO:** Important application events

**DEBUG:** Detailed diagnostic information

Implementing Distributed Tracing

Traces show a request's journey through your system.

A single user request might touch:

1. API gateway

2. Authentication service

3. User service

4. Order service

5. Payment service

6. Notification service

Each hop takes time. Each hop can fail.

What to Trace

Every service should:

Accept a trace ID from upstream

Add its own span with processing details

Pass trace ID downstream

Report back timing and status

**Result:** You can see the entire request flow with microsecond-level detail.

Tools

**Open Telemetry:** Language-agnostic standard

**Jaeger:** Open source distributed tracing

**Datadog APM:** Commercial solution

**New Relic:** Commercial solution

Building Context

The best debugging information is context. When something goes wrong:

What was the user doing?

What was the system state?

What had changed recently?

What was this service trying to do?

Context Information to Capture

Request ID (link everything to this request)

User ID

Feature flag state

Environment (staging vs. production)

Service version/deployment

Recent configuration changes

System resource usage

Putting It Together: The Debugging Experience

**Old way:** Incident happens at 3am

1. Page goes off

2. Run to computer

3. Check metrics dashboard

4. See error rate spike but little detail

5. SSH into server

6. Grep logs (10,000 matching lines)

7. Give up, restart service

8. Problem goes away

9. Never figure out root cause

**New way:** Incident happens at 3am

1. Page goes off

2. Open observability platform

3. Error rate spike shows in metrics

4. Click to see recent errors

5. Find relevant logs with request ID

6. Follow trace to see where latency occurred

7. Find the slow database query

8. Query is hitting wrong index

9. Fix index

10. Problem solved

11. Postmortem shows how to prevent

Implementation Timeline

Week 1: Structured Logging

Choose logging format (JSON)

Add logging to critical paths

Deploy to staging

Test log analysis

Week 2: Basic Tracing

Instrument major services

Add trace ID propagation

Deploy tracing backend

Create diagnostic dashboards

Week 3: Context and Correlation

Add request context to logs

Link logs to traces

Add business context

Create runbooks with examples

Week 4: Alerting on Patterns

Set up alerts on trace patterns

Create custom dashboards

Document troubleshooting procedures

Train team on new tools

Metrics to Track

Trace completion rate (% of traces with all services)

Trace latency (p50, p95, p99)

Log ingest rate and storage costs

Query response time for diagnostics

Common Pitfalls

**"Observability is too expensive"**

→ It's not if you're selective. Log errors, not everything.

**"We don't have time to add tracing"**

→ Add it incrementally. Start with critical paths.

**"Our logs are already massive"**

→ You're logging too much. Less, better structured logs are more valuable.

Your Next Step

Start with one service. Add structured logging. Deploy to production. See what you learn.

Observability is a practice, not a destination. Every improvement helps.

Ready to see into your systems? [Let's talk](/contact).

#observability#logging#tracing#monitoring

About the Author

Samalan Team is a platform reliability specialist with 15+ years of experience helping companies build scalable, reliable systems. Specializing in Kubernetes, platform engineering, and operational excellence.

Ready to implement these practices?

Let's discuss how to apply these strategies to your systems.

Schedule a Consultation