Monitoring and Logging¶
You don't know your app is down until a user tells you That's not monitoring , that's ignorance. Real monitoring tells you before users notice , pinpoints the failing component , and ideally routes the alert to someone who can fix it before it becomes an incident Logging without alerting is just a story you tell yourself after the fire
Application Monitoring with Prometheus¶
Prometheus scrapes metrics endpoints from your Node app and stores time-series data
# Install prom-client
npm install prom-client
// metrics.js - set up Prometheus metrics
const prometheus = require('prom-client')
// Create registry
const registry = new prometheus.Registry()
prometheus.collectDefaultMetrics({ register: registry })
// HTTP request duration histogram
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
})
// Active requests gauge
const activeRequests = new prometheus.Gauge({
name: 'http_requests_active',
help: 'Number of currently active HTTP requests'
})
// Error counter
const errorCounter = new prometheus.Counter({
name: 'http_errors_total',
help: 'Total number of HTTP errors',
labelNames: ['method', 'route', 'status_code']
})
module.exports = {
registry,
httpRequestDuration,
activeRequests,
errorCounter
}
// app.js - instrument your Express app
const { registry, httpRequestDuration, activeRequests, errorCounter } = require('./metrics')
// Track request duration and active count
app.use((req, res, next) => {
activeRequests.inc()
const end = httpRequestDuration.startTimer()
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || 'unknown', status_code: res.statusCode })
activeRequests.dec()
if (res.statusCode >= 400) {
errorCounter.inc({ method: req.method, route: req.route?.path || 'unknown', status_code: res.statusCode })
}
})
next()
})
// Metrics endpoint - Prometheus scrapes this
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType)
res.end(await registry.metrics())
})
// Health check - liveness probe
app.get('/healthz', (req, res) => {
res.json({ status: 'ok', timestamp: Date.now() })
})
// Readiness check - is app ready to serve traffic?
app.get('/readyz', (req, res) => {
const dbHealthy = db.ping()
const redisHealthy = redis.ping()
if (dbHealthy && redisHealthy) {
return res.json({ status: 'ready', checks: { db: dbHealthy, redis: redisHealthy } })
}
res.status(503).json({ status: 'not ready', checks: { db: dbHealthy, redis: redisHealthy } })
})
# Prometheus configuration - scrape your app
# prometheus.yml
scrape_configs:
- job_name: 'node_app'
static_configs:
- targets: ['localhost:3000']
metrics_path: '/metrics'
scrape_interval: 10s
scrape_timeout: 5s
Key metrics to track: - Request rate - requests per second , by route and method - Error rate - 4xx and 5xx responses , as percentage of total - Latency percentiles - p50 , p95 , p99 response times - Active connections - gauge of concurrent requests - Event loop lag - how long the event loop is blocked - Garbage collection - GC pauses and frequency - Memory usage - heap used , heap total , external memory
Health Check Endpoints¶
Kubernetes and Docker rely on health checks to decide if your app is alive
// liveness - is the process running?
// Kubernetes: if this fails , kubelet restarts the pod
app.get('/healthz', (req, res) => {
res.json({ status: 'ok' })
})
// readiness - is the app ready to serve traffic?
// Kubernetes: if this fails , pod is removed from service endpoints
app.get('/readyz', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
uptime: process.uptime()
}
const allHealthy = Object.values(checks).every(c => c.status === 'ok')
if (!allHealthy) {
res.status(503).json({ status: 'degraded', checks })
return
}
res.json({ status: 'ready', checks })
})
async function checkDatabase() {
try {
await db.raw('SELECT 1')
return { status: 'ok' }
} catch (err) {
return { status: 'error', message: 'cannot connect to database' }
}
}
async function checkRedis() {
try {
await redis.ping()
return { status: 'ok' }
} catch (err) {
return { status: 'error', message: 'cannot connect to redis' }
}
}
Health checks should be: - Fast - time out in under 5 seconds - Lightweight - don't run complex queries , just ping - Separate from authenticated endpoints - monitoring systems won't have auth tokens - Distinct - liveness and readiness are different things
Structured Logging with Winston/Pino¶
console.log is not logging - it's unstructured noise
Structured logging ships JSON objects that log aggregators parse , index , and alert on
// logger.js - Winston with structured JSON output
const winston = require('winston')
const { combine, timestamp, json, errors, printf } = winston.format
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: combine(
errors({ stack: true }),
timestamp(),
json()
),
defaultMeta: { service: 'myapp-api', environment: process.env.NODE_ENV },
transports: [
new winston.transports.File({ filename: 'logs/error.log', level: 'error' }),
new winston.transports.File({ filename: 'logs/combined.log' })
]
})
// In development , also log to console with colors
if (process.env.NODE_ENV !== 'production') {
logger.add(new winston.transports.Console({
format: combine(
printf(({ level, message, timestamp, ...meta }) => {
return `${timestamp} [${level}]: ${message} ${Object.keys(meta).length ? JSON.stringify(meta) : ''}`
})
)
}))
}
module.exports = logger
// Pino - faster than Winston , less overhead per log line
const pino = require('pino')
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label })
},
redact: ['req.headers.authorization', 'req.body.password', 'req.body.token'],
transport: process.env.NODE_ENV !== 'production'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined
})
// Express middleware for pino
const pinoHttp = require('pino-http')({ logger })
app.use(pinoHttp)
// Usage:
// app.use(pinoHttp) // auto-logs every request
// logger.info({ userId: 123 }, 'user action')
// logger.error({ err }, 'operation failed')
What to log: - Request method , path , status code , duration - User ID for authenticated requests (for audit trails) - Error stack traces (at error level only) - Database query times (at debug level) - Third-party API response times
What NEVER to log: - Passwords , tokens , API keys - redact them - Full request bodies that might contain PII - Database connection strings - Session tokens or JWT values
// Express middleware - redact sensitive fields
app.use((req, res, next) => {
const safeBody = { ...req.body }
delete safeBody.password
delete safeBody.token
delete safeBody.secret
logger.info({ method: req.method, path: req.path, body: safeBody }, 'incoming request')
next()
})
APM Tools - New Relic , Datadog , Sentry¶
Application Performance Monitoring catches what Prometheus metrics miss - slow database queries , memory leaks , transaction traces
# New Relic
npm install newrelic
# newrelic.js
exports.config = {
app_name: ['MyApp'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
logging: { level: 'info' },
allow_all_headers: true,
attributes: { exclude: ['request.headers.cookie', 'request.headers.authorization'] }
}
// In your app - require at the very top , before anything else
require('newrelic')
// Sentry - error tracking with context
npm install @sentry/node @sentry/profiling-node
const Sentry = require('@sentry/node')
const { nodeProfilingIntegration } = require('@sentry/profiling-node')
Sentry.init({
dsn: process.env.SENTRY_DSN,
integrations: [nodeProfilingIntegration()],
tracesSampleRate: 1.0,
profilesSampleRate: 1.0,
environment: process.env.NODE_ENV
})
// The error handler must be before any other error middleware
app.use(Sentry.Handlers.requestHandler())
app.use(Sentry.Handlers.tracingHandler())
// Routes here...
app.use(Sentry.Handlers.errorHandler())
APM tools add overhead - about 2-5% CPU depending on sampling rate
Worth it for catching the 0.1% of slow requests that break your p99 , disable in development
Uptime Monitoring¶
Health checks are only useful if something checks them
# External monitoring with curl
while true; do
status=$(curl -s -o /dev/null -w "%{http_code}" https://api.myapp.com/healthz)
if [ "$status" != "200" ]; then
echo "Health check failed: HTTP $status"
# Send alert
curl -X POST -H "Content-Type: application/json" \
-d '{"text": "Health check failed for api.myapp.com"}' \
$SLACK_WEBHOOK_URL
fi
sleep 30
done
External monitoring services: - Pingdom - checks every minute from 3 locations - UptimeRobot - free tier checks every 5 minutes - Better Uptime - status pages with incident management - Checkly - Playwright-based monitoring , tests real browser flows
External monitoring catches ISP outages , DNS failures , and CDN problems - internal health checks won't detect issues between your server and the internet
Alerting¶
Metrics without alerting are data hoarding
# Prometheus Alertmanager rules
# alertmanager-rules.yml
groups:
- name: node_app
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_errors_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 5% for 5 minutes"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p95 latency > 2 seconds"
description: "p95 latency is {{ $value }} seconds"
- alert: HighMemoryUsage
expr: |
process_resident_memory_bytes / process_virtual_memory_bytes > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Memory usage > 80% for 10 minutes"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
Route alerts to the right channel - critical goes to PagerDuty/OpsGenie , warnings to Slack/Teams , info to email
# alertmanager.yml
route:
receiver: 'slack-warnings'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: $PAGERDUTY_KEY
- name: 'slack-warnings'
slack_configs:
- api_url: $SLACK_WEBHOOK
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
Security Monitoring¶
Watch for attackers before they escalate from probe to breach
// Log authentication failures
app.post('/api/login', async (req, res) => {
const user = await User.findOne({ email: req.body.email })
if (!user || !bcrypt.compareSync(req.body.password, user.password)) {
logger.warn({
event: 'auth_failed',
email: req.body.email,
ip: req.ip,
userAgent: req.get('User-Agent'),
timestamp: Date.now()
}, 'failed login attempt')
// Track consecutive failures for this IP
await redis.incr(`auth_fail:${req.ip}`)
await redis.expire(`auth_fail:${req.ip}`, 900) // 15 min window
const attempts = await redis.get(`auth_fail:${req.ip}`)
if (attempts > 20) {
logger.error({
event: 'brute_force_detected',
ip: req.ip,
attempts
}, 'possible brute force attack')
// Alert: block IP or notify security team
}
return res.status(401).json({ error: 'invalid credentials' })
}
// Reset failure count on success
await redis.del(`auth_fail:${req.ip}`)
logger.info({ event: 'auth_success', email: req.body.email }, 'successful login')
})
Security events to monitor: - Failed login attempts - threshold-based alerts for brute force - Unauthorized access attempts - 403 responses to protected routes - Suspicious payload patterns - SQL injection attempts , XSS probes - API key usage spikes - possible credential stuffing - Rate limit exceeded - possible DDoS or misconfigured client
# Prometheus alert for brute force
- alert: BruteForceDetected
expr: |
rate(http_errors_total{route="/api/login",status_code="401"}[5m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Possible brute force attack on login endpoint"
description: "401 errors on /api/login: {{ $value }} per second"
Prerequisites¶
- deploy_04_reverse_proxy.md - reverse proxy before monitoring deployment
next -> adv_01_child_process.md