samalan Logo
SAMALAN
← Back to Blog

Observability Beyond Metrics: Logging, Tracing, and Context

Samalan Team
April 5, 2026
10 min read
Observability

Observability Beyond Metrics: Logging, Tracing, and Context

You have great metrics dashboards. CPU, memory, requests per second. But when something goes wrong, you're blind.

Observability isn't just metrics. It's metrics + logs + traces + context. This guide shows you how to implement real observability.

The Three Pillars of Observability

Pillar 1: Metrics (What's Happening?)

  • System resource usage
  • Application throughput
  • Business metrics
  • Error rates
  • **Example:** "Error rate went from 0.1% to 2.5%"

    Pillar 2: Logs (What Details?)

  • Structured logs with context
  • Error stack traces
  • Transaction logs
  • Audit logs
  • **Example:** "User 42 received error 'payment_failed' at 2026-04-05 14:23:15.234 UTC"

    Pillar 3: Traces (Why Did This Happen?)

  • Request journey through services
  • Latency distribution
  • Service dependencies
  • Bottlenecks
  • **Example:** "Request spent 800ms in database service, 100ms in cache service, 50ms in API service"

    Implementing Effective Logging

    Most teams log too much or too little.

    What to Log

    **DO log:**

  • Errors and exceptions (with stack trace)
  • Important state changes
  • User actions (if appropriate)
  • External API calls and responses
  • Debugging information
  • **DON'T log:**

  • Every function entry/exit
  • Passwords or PII
  • Irrelevant debug spam
  • Info you already have in metrics
  • Structured Logging

    Don't log strings. Log structured data.

    **Bad:**

    ERROR: Failed to process order 12345 from user john@example.com

    **Good:**

    {

    "level": "error",

    "timestamp": "2026-04-05T14:23:15.234Z",

    "service": "payment-service",

    "request_id": "req-98765",

    "user_id": 42,

    "order_id": 12345,

    "error": "payment_provider_timeout",

    "error_message": "Stripe API timeout after 30s",

    "status_code": 502,

    "retry_count": 2

    }

    Structured logs are searchable, filterable, and analyzable.

    Log Levels

  • **ERROR:** Something failed that shouldn't have
  • **WARN:** Something unexpected but potentially recoverable
  • **INFO:** Important application events
  • **DEBUG:** Detailed diagnostic information
  • Implementing Distributed Tracing

    Traces show a request's journey through your system.

    A single user request might touch:

    1. API gateway

    2. Authentication service

    3. User service

    4. Order service

    5. Payment service

    6. Notification service

    Each hop takes time. Each hop can fail.

    What to Trace

    Every service should:

  • Accept a trace ID from upstream
  • Add its own span with processing details
  • Pass trace ID downstream
  • Report back timing and status
  • **Result:** You can see the entire request flow with microsecond-level detail.

    Tools

  • **Open Telemetry:** Language-agnostic standard
  • **Jaeger:** Open source distributed tracing
  • **Datadog APM:** Commercial solution
  • **New Relic:** Commercial solution
  • Building Context

    The best debugging information is context. When something goes wrong:

  • What was the user doing?
  • What was the system state?
  • What had changed recently?
  • What was this service trying to do?
  • Context Information to Capture

  • Request ID (link everything to this request)
  • User ID
  • Feature flag state
  • Environment (staging vs. production)
  • Service version/deployment
  • Recent configuration changes
  • System resource usage
  • Putting It Together: The Debugging Experience

    **Old way:** Incident happens at 3am

    1. Page goes off

    2. Run to computer

    3. Check metrics dashboard

    4. See error rate spike but little detail

    5. SSH into server

    6. Grep logs (10,000 matching lines)

    7. Give up, restart service

    8. Problem goes away

    9. Never figure out root cause

    **New way:** Incident happens at 3am

    1. Page goes off

    2. Open observability platform

    3. Error rate spike shows in metrics

    4. Click to see recent errors

    5. Find relevant logs with request ID

    6. Follow trace to see where latency occurred

    7. Find the slow database query

    8. Query is hitting wrong index

    9. Fix index

    10. Problem solved

    11. Postmortem shows how to prevent

    Implementation Timeline

    Week 1: Structured Logging

  • Choose logging format (JSON)
  • Add logging to critical paths
  • Deploy to staging
  • Test log analysis
  • Week 2: Basic Tracing

  • Instrument major services
  • Add trace ID propagation
  • Deploy tracing backend
  • Create diagnostic dashboards
  • Week 3: Context and Correlation

  • Add request context to logs
  • Link logs to traces
  • Add business context
  • Create runbooks with examples
  • Week 4: Alerting on Patterns

  • Set up alerts on trace patterns
  • Create custom dashboards
  • Document troubleshooting procedures
  • Train team on new tools
  • Metrics to Track

  • Trace completion rate (% of traces with all services)
  • Trace latency (p50, p95, p99)
  • Log ingest rate and storage costs
  • Query response time for diagnostics
  • Common Pitfalls

    **"Observability is too expensive"**

    → It's not if you're selective. Log errors, not everything.

    **"We don't have time to add tracing"**

    → Add it incrementally. Start with critical paths.

    **"Our logs are already massive"**

    → You're logging too much. Less, better structured logs are more valuable.

    Your Next Step

    Start with one service. Add structured logging. Deploy to production. See what you learn.

    Observability is a practice, not a destination. Every improvement helps.

    Ready to see into your systems? [Let's talk](/contact).

    #observability#logging#tracing#monitoring

    About the Author

    Samalan Team is a platform reliability specialist with 15+ years of experience helping companies build scalable, reliable systems. Specializing in Kubernetes, platform engineering, and operational excellence.

    Ready to implement these practices?

    Let's discuss how to apply these strategies to your systems.

    Schedule a Consultation