Monitoring & Observability¶
Table of Contents¶
- Metrics - Prometheus + Grafana
- Tracing - OpenTelemetry
- Health Checks - /healthz , /readyz , /metrics
- Alerting - PagerDuty, Slack , Grafana Alerts
- Dashboards - Grafana Example
flowchart LR
subgraph App["Your Application"]
AppCode["Node.js App
Express / Nest / Next"]
Logs["JSON Logs
pino / winston"]
MetricsApp["Metrics Endpoint
/metrics (prom-client)"]
TracesApp["Tracing SDK
@opentelemetry"]
end
subgraph Collection["Collection Layer"]
Prom["Prometheus
pulls metrics"]
Loki["Loki
stores logs"]
Tempo["Tempo
stores traces"]
OTEL["OpenTelemetry
Collector"]
end
subgraph Visualize["Visualization & Alerting"]
Grafana["Grafana
dashboards"]
Alerts["Alertmanager
PagerDuty / Slack"]
end
MetricsApp -->|"scrape /metrics"| Prom
Logs -->|"push"| Loki
TracesApp -->|"OTLP"| OTEL --> Tempo
Prom --> Grafana
Loki --> Grafana
Tempo --> Grafana
Prom --> Alerts
style AppCode fill:#3498db,color:#fff
style Logs fill:#f1c40f,color:#000
style MetricsApp fill:#2ecc71,color:#fff
style TracesApp fill:#9b59b6,color:#fff
style Prom fill:#e74c3c,color:#fff
style Loki fill:#e67e22,color:#fff
style Tempo fill:#9b59b6,color:#fff
style Grafana fill:#f1c40f,color:#000
style Alerts fill:#e74c3c,color:#fff Logging everything with console.log and praying you find the bug in 10k lines of mixed text isn't monitoring - it's a cry for help and I've seen production apps running exactly like this because teams don't invest in observability until after the first post-mortem Observability means you can understand what your system is doing without deploying new code. Three pillars: logs , metrics , traces
Structured Logging - JSON or GTFO¶
// Bad - 2007 called , they want their debugging back
console.log(`User ${userId} logged in from ${ip}`)
// grep "logged in from" /var/log/app.log | awk '{print $3}'
// This is why parsing logs is a nightmare
// Good - structured JSON
const logger = require('pino')() // Faster than Winston , bunyan
logger.info({
event: 'user.login',
userId: userId,
ip: ip,
userAgent: req.headers['user-agent'],
duration: Date.now() - startTime
}, 'User login successful')
// Log output - parseable by any log aggregation tool
{"level":30,"time":1712345678901,"pid":1234,"hostname":"web-1","event":"user.login","userId":"abc123","ip":"203.0.113.42","duration":42}
Pino setup for production:
const pino = require('pino')
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level(label) {
return { level: label }
}
},
// Redact sensitive fields
redact: {
paths: ['req.headers.authorization', 'req.headers.cookie', 'password', 'token'],
censor: '[REDACTED]'
},
serializers: {
req: pino.stdSerializers.req,
res: pino.stdSerializers.res,
err: pino.stdSerializers.err
},
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' } // Pretty print in dev
: undefined // JSON in production
})
// Express middleware
app.use(require('pino-http')({ logger }))
What to log: * Request metadata (method , path , status , duration , user ID) * Errors with full context (stack trace , request ID , relevant state) * Business events (user signup , purchase , data export) * External service calls (database queries , API requests - timing and status)
What NOT to log: * Passwords , tokens , API keys * Personal data (emails , credit cards , SSNs) * Request/response bodies larger than 10KB * Binary data or base64-encoded files
Metrics - Prometheus + Grafana¶
Metrics are numbers over time - request rates , error rates , latency percentiles
const prometheus = require('prom-client')
// Create a registry
const register = new prometheus.Registry()
prometheus.collectDefaultMetrics({ register })
// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10] // In seconds
})
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
})
// Middleware to record metrics
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer()
res.on('finish', () => {
const labels = {
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode
}
httpRequestsTotal.inc(labels)
end(labels)
})
next()
})
// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})
Prometheus config (prometheus.yml):
scrape_configs:
- job_name: 'node-app'
scrape_interval: 15s
static_configs:
- targets: ['localhost:3000']
labels:
service: 'backend'
environment: 'production'
Key metrics to track: * Request rate (requests/second) - are you doing more work? * Error rate (% of 5xx responses) - is the app healthy? * Latency (p50 , p95 , p99 response time) - is it fast enough? * Active users - how many are using the system? * Database query time - is the DB the bottleneck? * Memory usage - memory leak detection * CPU usage - compute-bound or not? * GC pressure (Node.js) - garbage collector frequency and duration
Tracing - OpenTelemetry¶
Traces follow a single request through multiple services - you see where the time goes
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node')
const { Resource } = require('@opentelemetry/resources')
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions')
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express')
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')
const provider = new NodeTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'backend-api',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0'
})
})
provider.addSpanProcessor(new SimpleSpanProcessor(
new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces'
})
))
provider.register()
// Auto-instrument Express and HTTP
require('@opentelemetry/instrumentation').registerInstrumentations({
instrumentations: [
new ExpressInstrumentation(),
new HttpInstrumentation()
]
})
// Manual tracing for specific operations
const tracer = provider.getTracer('backend-api')
app.post('/api/orders', async (req, res) => {
const span = tracer.startSpan('create-order')
try {
const user = await getUser(req.userId) // Auto-traced
const order = await tracer.startActiveSpan('db.insert-order', async (span) => {
const result = await db.query('INSERT INTO orders ...')
span.setAttribute('order.id', result.id)
return result
})
await sendEmail(order) // Auto-traced
span.setStatus({ code: SpanStatusCode.OK })
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
throw error
} finally {
span.end()
}
})
Tracing tools: * Jaeger - open-source trace visualization * Zipkin - alternative to Jaeger * Grafana Tempo - Grafana's tracing backend * Honeycomb - SaaS , expensive but powerful
Health Checks - /healthz , /readyz , /metrics¶
Load balancers and orchestrators need to know when your app is actually ready
// Liveness - is the process alive?
app.get('/healthz', (req, res) => {
res.status(200).json({ status: 'ok' })
})
// Readiness - is the app ready to serve traffic?
app.get('/readyz', async (req, res) => {
const checks = {
database: false,
redis: false,
status: 'not_ready'
}
// Check database
try {
await db.query('SELECT 1')
checks.database = true
} catch (err) {
// Doesn't throw - just reports failure
}
// Check Redis
try {
await redisClient.ping()
checks.redis = true
} catch (err) {
// Doesn't throw
}
checks.status = checks.database && checks.redis ? 'ok' : 'degraded'
const statusCode = checks.status === 'ok' ? 200 : 503
res.status(statusCode).json(checks)
})
// Startup - has the app initialized?
let startupComplete = false
app.on('ready', () => { startupComplete = true })
app.get('/startupz', (req, res) => {
if (startupComplete) {
res.status(200).json({ status: 'ok' })
} else {
res.status(503).json({ status: 'starting' })
}
})
Alerting - PagerDuty, Slack , Grafana Alerts¶
Metrics without alerts are just graphs - you need automated notifications
# Prometheus alert rules
groups:
- name: backend-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 5% for 5 minutes"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p95 latency > 2 seconds"
- alert: InstanceDown
expr: up{job="node-app"} == 0
for: 1m
labels:
severity: critical
Dashboards - Grafana Example¶
Create a Grafana dashboard with: * RPS panel - requests per second (rate of http_requests_total) * Error rate panel - % of 5xx errors * Latency panel - p50 , p95 , p99 lines * Active users panel - unique users over time * Database panel - query time , pool utilization * System panel - CPU , memory , Node event loop lag
next => devops_11_kubernetes