Monitoring & Observability¶

Table of Contents¶

Metrics - Prometheus + Grafana
Tracing - OpenTelemetry
Health Checks - /healthz , /readyz , /metrics
Alerting - PagerDuty, Slack , Grafana Alerts
Dashboards - Grafana Example

flowchart LR
    subgraph App["Your Application"]
        AppCode["Node.js App
        Express / Nest / Next"]
        Logs["JSON Logs
        pino / winston"]
        MetricsApp["Metrics Endpoint
        /metrics (prom-client)"]
        TracesApp["Tracing SDK
        @opentelemetry"]
    end

    subgraph Collection["Collection Layer"]
        Prom["Prometheus
        pulls metrics"]
        Loki["Loki
        stores logs"]
        Tempo["Tempo
        stores traces"]
        OTEL["OpenTelemetry
        Collector"]
    end

    subgraph Visualize["Visualization & Alerting"]
        Grafana["Grafana
        dashboards"]
        Alerts["Alertmanager
        PagerDuty / Slack"]
    end

    MetricsApp -->|"scrape /metrics"| Prom
    Logs -->|"push"| Loki
    TracesApp -->|"OTLP"| OTEL --> Tempo
    Prom --> Grafana
    Loki --> Grafana
    Tempo --> Grafana
    Prom --> Alerts

    style AppCode fill:#3498db,color:#fff
    style Logs fill:#f1c40f,color:#000
    style MetricsApp fill:#2ecc71,color:#fff
    style TracesApp fill:#9b59b6,color:#fff
    style Prom fill:#e74c3c,color:#fff
    style Loki fill:#e67e22,color:#fff
    style Tempo fill:#9b59b6,color:#fff
    style Grafana fill:#f1c40f,color:#000
    style Alerts fill:#e74c3c,color:#fff

Logging everything with console.log and praying you find the bug in 10k lines of mixed text isn't monitoring - it's a cry for help and I've seen production apps running exactly like this because teams don't invest in observability until after the first post-mortem Observability means you can understand what your system is doing without deploying new code. Three pillars: logs , metrics , traces

Structured Logging - JSON or GTFO¶

// Bad - 2007 called , they want their debugging back
console.log(`User ${userId} logged in from ${ip}`)
// grep "logged in from" /var/log/app.log | awk '{print $3}'
// This is why parsing logs is a nightmare

// Good - structured JSON
const logger = require('pino')()    // Faster than Winston , bunyan

logger.info({
  event: 'user.login',
  userId: userId,
  ip: ip,
  userAgent: req.headers['user-agent'],
  duration: Date.now() - startTime
}, 'User login successful')

// Log output - parseable by any log aggregation tool
{"level":30,"time":1712345678901,"pid":1234,"hostname":"web-1","event":"user.login","userId":"abc123","ip":"203.0.113.42","duration":42}

Pino setup for production:

const pino = require('pino')

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level(label) {
      return { level: label }
    }
  },
  // Redact sensitive fields
  redact: {
    paths: ['req.headers.authorization', 'req.headers.cookie', 'password', 'token'],
    censor: '[REDACTED]'
  },
  serializers: {
    req: pino.stdSerializers.req,
    res: pino.stdSerializers.res,
    err: pino.stdSerializers.err
  },
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty' }     // Pretty print in dev
    : undefined                       // JSON in production
})

// Express middleware
app.use(require('pino-http')({ logger }))

What to log: * Request metadata (method , path , status , duration , user ID) * Errors with full context (stack trace , request ID , relevant state) * Business events (user signup , purchase , data export) * External service calls (database queries , API requests - timing and status)

What NOT to log: * Passwords , tokens , API keys * Personal data (emails , credit cards , SSNs) * Request/response bodies larger than 10KB * Binary data or base64-encoded files

Metrics - Prometheus + Grafana¶

Metrics are numbers over time - request rates , error rates , latency percentiles

const prometheus = require('prom-client')

// Create a registry
const register = new prometheus.Registry()
prometheus.collectDefaultMetrics({ register })

// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]  // In seconds
})

const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
})

// Middleware to record metrics
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer()

  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    }

    httpRequestsTotal.inc(labels)
    end(labels)
  })

  next()
})

// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

Prometheus config (prometheus.yml):

scrape_configs:
  - job_name: 'node-app'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:3000']
        labels:
          service: 'backend'
          environment: 'production'

Key metrics to track: * Request rate (requests/second) - are you doing more work? * Error rate (% of 5xx responses) - is the app healthy? * Latency (p50 , p95 , p99 response time) - is it fast enough? * Active users - how many are using the system? * Database query time - is the DB the bottleneck? * Memory usage - memory leak detection * CPU usage - compute-bound or not? * GC pressure (Node.js) - garbage collector frequency and duration

Tracing - OpenTelemetry¶

Traces follow a single request through multiple services - you see where the time goes

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node')
const { Resource } = require('@opentelemetry/resources')
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions')
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express')
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')

const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'backend-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0'
  })
})

provider.addSpanProcessor(new SimpleSpanProcessor(
  new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces'
  })
))

provider.register()

// Auto-instrument Express and HTTP
require('@opentelemetry/instrumentation').registerInstrumentations({
  instrumentations: [
    new ExpressInstrumentation(),
    new HttpInstrumentation()
  ]
})

// Manual tracing for specific operations
const tracer = provider.getTracer('backend-api')

app.post('/api/orders', async (req, res) => {
  const span = tracer.startSpan('create-order')

  try {
    const user = await getUser(req.userId)     // Auto-traced

    const order = await tracer.startActiveSpan('db.insert-order', async (span) => {
      const result = await db.query('INSERT INTO orders ...')
      span.setAttribute('order.id', result.id)
      return result
    })

    await sendEmail(order)                       // Auto-traced
    span.setStatus({ code: SpanStatusCode.OK })
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
    throw error
  } finally {
    span.end()
  }
})

Tracing tools: * Jaeger - open-source trace visualization * Zipkin - alternative to Jaeger * Grafana Tempo - Grafana's tracing backend * Honeycomb - SaaS , expensive but powerful

Health Checks - /healthz , /readyz , /metrics¶

Load balancers and orchestrators need to know when your app is actually ready

// Liveness - is the process alive?
app.get('/healthz', (req, res) => {
  res.status(200).json({ status: 'ok' })
})

// Readiness - is the app ready to serve traffic?
app.get('/readyz', async (req, res) => {
  const checks = {
    database: false,
    redis: false,
    status: 'not_ready'
  }

  // Check database
  try {
    await db.query('SELECT 1')
    checks.database = true
  } catch (err) {
    // Doesn't throw - just reports failure
  }

  // Check Redis
  try {
    await redisClient.ping()
    checks.redis = true
  } catch (err) {
    // Doesn't throw
  }

  checks.status = checks.database && checks.redis ? 'ok' : 'degraded'
  const statusCode = checks.status === 'ok' ? 200 : 503

  res.status(statusCode).json(checks)
})

// Startup - has the app initialized?
let startupComplete = false
app.on('ready', () => { startupComplete = true })

app.get('/startupz', (req, res) => {
  if (startupComplete) {
    res.status(200).json({ status: 'ok' })
  } else {
    res.status(503).json({ status: 'starting' })
  }
})

Alerting - PagerDuty, Slack , Grafana Alerts¶

Metrics without alerts are just graphs - you need automated notifications

# Prometheus alert rules
groups:
  - name: backend-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% for 5 minutes"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency > 2 seconds"

      - alert: InstanceDown
        expr: up{job="node-app"} == 0
        for: 1m
        labels:
          severity: critical

Dashboards - Grafana Example¶

Create a Grafana dashboard with: * RPS panel - requests per second (rate of http_requests_total) * Error rate panel - % of 5xx errors * Latency panel - p50 , p95 , p99 lines * Active users panel - unique users over time * Database panel - query time , pool utilization * System panel - CPU , memory , Node event loop lag

next => devops_11_kubernetes