Skip to content

Monitoring and Logging

You don't know your app is down until a user tells you That's not monitoring , that's ignorance. Real monitoring tells you before users notice , pinpoints the failing component , and ideally routes the alert to someone who can fix it before it becomes an incident Logging without alerting is just a story you tell yourself after the fire

Application Monitoring with Prometheus

Prometheus scrapes metrics endpoints from your Node app and stores time-series data

# Install prom-client
npm install prom-client
// metrics.js - set up Prometheus metrics
const prometheus = require('prom-client')

// Create registry
const registry = new prometheus.Registry()
prometheus.collectDefaultMetrics({ register: registry })

// HTTP request duration histogram
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
})

// Active requests gauge
const activeRequests = new prometheus.Gauge({
  name: 'http_requests_active',
  help: 'Number of currently active HTTP requests'
})

// Error counter
const errorCounter = new prometheus.Counter({
  name: 'http_errors_total',
  help: 'Total number of HTTP errors',
  labelNames: ['method', 'route', 'status_code']
})

module.exports = {
  registry,
  httpRequestDuration,
  activeRequests,
  errorCounter
}
// app.js - instrument your Express app
const { registry, httpRequestDuration, activeRequests, errorCounter } = require('./metrics')

// Track request duration and active count
app.use((req, res, next) => {
  activeRequests.inc()
  const end = httpRequestDuration.startTimer()

  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || 'unknown', status_code: res.statusCode })
    activeRequests.dec()

    if (res.statusCode >= 400) {
      errorCounter.inc({ method: req.method, route: req.route?.path || 'unknown', status_code: res.statusCode })
    }
  })
  next()
})

// Metrics endpoint - Prometheus scrapes this
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType)
  res.end(await registry.metrics())
})

// Health check - liveness probe
app.get('/healthz', (req, res) => {
  res.json({ status: 'ok', timestamp: Date.now() })
})

// Readiness check - is app ready to serve traffic?
app.get('/readyz', (req, res) => {
  const dbHealthy = db.ping()
  const redisHealthy = redis.ping()
  if (dbHealthy && redisHealthy) {
    return res.json({ status: 'ready', checks: { db: dbHealthy, redis: redisHealthy } })
  }
  res.status(503).json({ status: 'not ready', checks: { db: dbHealthy, redis: redisHealthy } })
})
# Prometheus configuration - scrape your app
# prometheus.yml
scrape_configs:
  - job_name: 'node_app'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s

Key metrics to track: - Request rate - requests per second , by route and method - Error rate - 4xx and 5xx responses , as percentage of total - Latency percentiles - p50 , p95 , p99 response times - Active connections - gauge of concurrent requests - Event loop lag - how long the event loop is blocked - Garbage collection - GC pauses and frequency - Memory usage - heap used , heap total , external memory

Health Check Endpoints

Kubernetes and Docker rely on health checks to decide if your app is alive

// liveness - is the process running?
// Kubernetes: if this fails , kubelet restarts the pod
app.get('/healthz', (req, res) => {
  res.json({ status: 'ok' })
})

// readiness - is the app ready to serve traffic?
// Kubernetes: if this fails , pod is removed from service endpoints
app.get('/readyz', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    uptime: process.uptime()
  }

  const allHealthy = Object.values(checks).every(c => c.status === 'ok')
  if (!allHealthy) {
    res.status(503).json({ status: 'degraded', checks })
    return
  }
  res.json({ status: 'ready', checks })
})

async function checkDatabase() {
  try {
    await db.raw('SELECT 1')
    return { status: 'ok' }
  } catch (err) {
    return { status: 'error', message: 'cannot connect to database' }
  }
}

async function checkRedis() {
  try {
    await redis.ping()
    return { status: 'ok' }
  } catch (err) {
    return { status: 'error', message: 'cannot connect to redis' }
  }
}

Health checks should be: - Fast - time out in under 5 seconds - Lightweight - don't run complex queries , just ping - Separate from authenticated endpoints - monitoring systems won't have auth tokens - Distinct - liveness and readiness are different things

Structured Logging with Winston/Pino

console.log is not logging - it's unstructured noise
Structured logging ships JSON objects that log aggregators parse , index , and alert on

// logger.js - Winston with structured JSON output
const winston = require('winston')
const { combine, timestamp, json, errors, printf } = winston.format

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: combine(
    errors({ stack: true }),
    timestamp(),
    json()
  ),
  defaultMeta: { service: 'myapp-api', environment: process.env.NODE_ENV },
  transports: [
    new winston.transports.File({ filename: 'logs/error.log', level: 'error' }),
    new winston.transports.File({ filename: 'logs/combined.log' })
  ]
})

// In development , also log to console with colors
if (process.env.NODE_ENV !== 'production') {
  logger.add(new winston.transports.Console({
    format: combine(
      printf(({ level, message, timestamp, ...meta }) => {
        return `${timestamp} [${level}]: ${message} ${Object.keys(meta).length ? JSON.stringify(meta) : ''}`
      })
    )
  }))
}

module.exports = logger
// Pino - faster than Winston , less overhead per log line
const pino = require('pino')

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label })
  },
  redact: ['req.headers.authorization', 'req.body.password', 'req.body.token'],
  transport: process.env.NODE_ENV !== 'production'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined
})

// Express middleware for pino
const pinoHttp = require('pino-http')({ logger })
app.use(pinoHttp)

// Usage:
// app.use(pinoHttp)  // auto-logs every request
// logger.info({ userId: 123 }, 'user action')
// logger.error({ err }, 'operation failed')

What to log: - Request method , path , status code , duration - User ID for authenticated requests (for audit trails) - Error stack traces (at error level only) - Database query times (at debug level) - Third-party API response times

What NEVER to log: - Passwords , tokens , API keys - redact them - Full request bodies that might contain PII - Database connection strings - Session tokens or JWT values

// Express middleware - redact sensitive fields
app.use((req, res, next) => {
  const safeBody = { ...req.body }
  delete safeBody.password
  delete safeBody.token
  delete safeBody.secret
  logger.info({ method: req.method, path: req.path, body: safeBody }, 'incoming request')
  next()
})

APM Tools - New Relic , Datadog , Sentry

Application Performance Monitoring catches what Prometheus metrics miss - slow database queries , memory leaks , transaction traces

# New Relic
npm install newrelic

# newrelic.js
exports.config = {
  app_name: ['MyApp'],
  license_key: process.env.NEW_RELIC_LICENSE_KEY,
  logging: { level: 'info' },
  allow_all_headers: true,
  attributes: { exclude: ['request.headers.cookie', 'request.headers.authorization'] }
}

// In your app - require at the very top , before anything else
require('newrelic')
// Sentry - error tracking with context
npm install @sentry/node @sentry/profiling-node

const Sentry = require('@sentry/node')
const { nodeProfilingIntegration } = require('@sentry/profiling-node')

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  integrations: [nodeProfilingIntegration()],
  tracesSampleRate: 1.0,
  profilesSampleRate: 1.0,
  environment: process.env.NODE_ENV
})

// The error handler must be before any other error middleware
app.use(Sentry.Handlers.requestHandler())
app.use(Sentry.Handlers.tracingHandler())

// Routes here...

app.use(Sentry.Handlers.errorHandler())

APM tools add overhead - about 2-5% CPU depending on sampling rate
Worth it for catching the 0.1% of slow requests that break your p99 , disable in development

Uptime Monitoring

Health checks are only useful if something checks them

# External monitoring with curl
while true; do
  status=$(curl -s -o /dev/null -w "%{http_code}" https://api.myapp.com/healthz)
  if [ "$status" != "200" ]; then
    echo "Health check failed: HTTP $status"
    # Send alert
    curl -X POST -H "Content-Type: application/json" \
      -d '{"text": "Health check failed for api.myapp.com"}' \
      $SLACK_WEBHOOK_URL
  fi
  sleep 30
done

External monitoring services: - Pingdom - checks every minute from 3 locations - UptimeRobot - free tier checks every 5 minutes - Better Uptime - status pages with incident management - Checkly - Playwright-based monitoring , tests real browser flows

External monitoring catches ISP outages , DNS failures , and CDN problems - internal health checks won't detect issues between your server and the internet

Alerting

Metrics without alerting are data hoarding

# Prometheus Alertmanager rules
# alertmanager-rules.yml
groups:
  - name: node_app
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_errors_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% for 5 minutes"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency > 2 seconds"
          description: "p95 latency is {{ $value }} seconds"

      - alert: HighMemoryUsage
        expr: |
          process_resident_memory_bytes / process_virtual_memory_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage > 80% for 10 minutes"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

Route alerts to the right channel - critical goes to PagerDuty/OpsGenie , warnings to Slack/Teams , info to email

# alertmanager.yml
route:
  receiver: 'slack-warnings'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: $PAGERDUTY_KEY

  - name: 'slack-warnings'
    slack_configs:
      - api_url: $SLACK_WEBHOOK
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'

Security Monitoring

Watch for attackers before they escalate from probe to breach

// Log authentication failures
app.post('/api/login', async (req, res) => {
  const user = await User.findOne({ email: req.body.email })
  if (!user || !bcrypt.compareSync(req.body.password, user.password)) {
    logger.warn({
      event: 'auth_failed',
      email: req.body.email,
      ip: req.ip,
      userAgent: req.get('User-Agent'),
      timestamp: Date.now()
    }, 'failed login attempt')

    // Track consecutive failures for this IP
    await redis.incr(`auth_fail:${req.ip}`)
    await redis.expire(`auth_fail:${req.ip}`, 900)  // 15 min window

    const attempts = await redis.get(`auth_fail:${req.ip}`)
    if (attempts > 20) {
      logger.error({
        event: 'brute_force_detected',
        ip: req.ip,
        attempts
      }, 'possible brute force attack')
      // Alert: block IP or notify security team
    }

    return res.status(401).json({ error: 'invalid credentials' })
  }

  // Reset failure count on success
  await redis.del(`auth_fail:${req.ip}`)
  logger.info({ event: 'auth_success', email: req.body.email }, 'successful login')
})

Security events to monitor: - Failed login attempts - threshold-based alerts for brute force - Unauthorized access attempts - 403 responses to protected routes - Suspicious payload patterns - SQL injection attempts , XSS probes - API key usage spikes - possible credential stuffing - Rate limit exceeded - possible DDoS or misconfigured client

# Prometheus alert for brute force
- alert: BruteForceDetected
  expr: |
    rate(http_errors_total{route="/api/login",status_code="401"}[5m]) > 10
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Possible brute force attack on login endpoint"
    description: "401 errors on /api/login: {{ $value }} per second"

Prerequisites


next -> adv_01_child_process.md