Monitoring & Observability¶
Logging everything with console.log and praying you find the bug in 10k lines of mixed text isn't monitoring — it's a cry for help and I've seen production apps running exactly like this because teams don't invest in observability until after the first post-mortem Observability means you can understand what your system is doing without deploying new code. Three pillars: logs , metrics , traces
Structured Logging — JSON or GTFO¶
// Bad — 2007 called , they want their debugging back
console.log(`User ${userId} logged in from ${ip}`)
// grep "logged in from" /var/log/app.log | awk '{print $3}'
// This is why parsing logs is a nightmare
// Good — structured JSON
const logger = require('pino')() // Faster than Winston , bunyan
logger.info({
event: 'user.login',
userId: userId,
ip: ip,
userAgent: req.headers['user-agent'],
duration: Date.now() - startTime
}, 'User login successful')
// Log output — parseable by any log aggregation tool
{"level":30,"time":1712345678901,"pid":1234,"hostname":"web-1","event":"user.login","userId":"abc123","ip":"203.0.113.42","duration":42}
Pino setup for production:
const pino = require('pino')
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level(label) {
return { level: label }
}
},
// Redact sensitive fields
redact: {
paths: ['req.headers.authorization', 'req.headers.cookie', 'password', 'token'],
censor: '[REDACTED]'
},
serializers: {
req: pino.stdSerializers.req,
res: pino.stdSerializers.res,
err: pino.stdSerializers.err
},
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' } // Pretty print in dev
: undefined // JSON in production
})
// Express middleware
app.use(require('pino-http')({ logger }))
What to log: * Request metadata (method , path , status , duration , user ID) * Errors with full context (stack trace , request ID , relevant state) * Business events (user signup , purchase , data export) * External service calls (database queries , API requests — timing and status)
What NOT to log: * Passwords , tokens , API keys * Personal data (emails , credit cards , SSNs) * Request/response bodies larger than 10KB * Binary data or base64-encoded files
Metrics — Prometheus + Grafana¶
Metrics are numbers over time — request rates , error rates , latency percentiles
const prometheus = require('prom-client')
// Create a registry
const register = new prometheus.Registry()
prometheus.collectDefaultMetrics({ register })
// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10] // In seconds
})
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
})
// Middleware to record metrics
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer()
res.on('finish', () => {
const labels = {
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode
}
httpRequestsTotal.inc(labels)
end(labels)
})
next()
})
// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})
Prometheus config (prometheus.yml):
scrape_configs:
- job_name: 'node-app'
scrape_interval: 15s
static_configs:
- targets: ['localhost:3000']
labels:
service: 'backend'
environment: 'production'
Key metrics to track: * Request rate (requests/second) — are you doing more work? * Error rate (% of 5xx responses) — is the app healthy? * Latency (p50 , p95 , p99 response time) — is it fast enough? * Active users — how many are using the system? * Database query time — is the DB the bottleneck? * Memory usage — memory leak detection * CPU usage — compute-bound or not? * GC pressure (Node.js) — garbage collector frequency and duration
Tracing — OpenTelemetry¶
Traces follow a single request through multiple services — you see where the time goes
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node')
const { Resource } = require('@opentelemetry/resources')
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions')
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express')
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')
const provider = new NodeTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'backend-api',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0'
})
})
provider.addSpanProcessor(new SimpleSpanProcessor(
new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces'
})
))
provider.register()
// Auto-instrument Express and HTTP
require('@opentelemetry/instrumentation').registerInstrumentations({
instrumentations: [
new ExpressInstrumentation(),
new HttpInstrumentation()
]
})
// Manual tracing for specific operations
const tracer = provider.getTracer('backend-api')
app.post('/api/orders', async (req, res) => {
const span = tracer.startSpan('create-order')
try {
const user = await getUser(req.userId) // Auto-traced
const order = await tracer.startActiveSpan('db.insert-order', async (span) => {
const result = await db.query('INSERT INTO orders ...')
span.setAttribute('order.id', result.id)
return result
})
await sendEmail(order) // Auto-traced
span.setStatus({ code: SpanStatusCode.OK })
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
throw error
} finally {
span.end()
}
})
Tracing tools: * Jaeger — open-source trace visualization * Zipkin — alternative to Jaeger * Grafana Tempo — Grafana's tracing backend * Honeycomb — SaaS , expensive but powerful
Health Checks — /healthz , /readyz , /metrics¶
Load balancers and orchestrators need to know when your app is actually ready
// Liveness — is the process alive?
app.get('/healthz', (req, res) => {
res.status(200).json({ status: 'ok' })
})
// Readiness — is the app ready to serve traffic?
app.get('/readyz', async (req, res) => {
const checks = {
database: false,
redis: false,
status: 'not_ready'
}
// Check database
try {
await db.query('SELECT 1')
checks.database = true
} catch (err) {
// Doesn't throw — just reports failure
}
// Check Redis
try {
await redisClient.ping()
checks.redis = true
} catch (err) {
// Doesn't throw
}
checks.status = checks.database && checks.redis ? 'ok' : 'degraded'
const statusCode = checks.status === 'ok' ? 200 : 503
res.status(statusCode).json(checks)
})
// Startup — has the app initialized?
let startupComplete = false
app.on('ready', () => { startupComplete = true })
app.get('/startupz', (req, res) => {
if (startupComplete) {
res.status(200).json({ status: 'ok' })
} else {
res.status(503).json({ status: 'starting' })
}
})
Alerting — PagerDuty, Slack , Grafana Alerts¶
Metrics without alerts are just graphs — you need automated notifications
# Prometheus alert rules
groups:
- name: backend-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 5% for 5 minutes"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p95 latency > 2 seconds"
- alert: InstanceDown
expr: up{job="node-app"} == 0
for: 1m
labels:
severity: critical
Dashboards — Grafana Example¶
Create a Grafana dashboard with: * RPS panel — requests per second (rate of http_requests_total) * Error rate panel — % of 5xx errors * Latency panel — p50 , p95 , p99 lines * Active users panel — unique users over time * Database panel — query time , pool utilization * System panel — CPU , memory , Node event loop lag
next → devops_11_kubernetes.md