Indexes & Performance¶

Without indexes , your queries scan every row in the table That's fine for 100 users. Awful for 100,000. Catastrophic for 10 million and you might as well just SELECT * FROM pain because your users will timeout waiting for pages to load while Postgres grinds through sequential scans like it's 1985 and nobody told it about the internet yet

B-tree indexes - the default for a reason¶

B-tree (balanced tree) is the default index type in Postgres and handles most scenarios: equality checks (WHERE id = 5), range queries (WHERE created_at > '2024-01-01'), sorting (ORDER BY email), and pattern matching with prefix wildcards (WHERE email LIKE 'ali@%'). It maintains sorted order in a tree structure where each node points to child nodes, reducing search from O(n) sequential scan to O(log n) tree traversal

-- Create a B-tree index (implicit - this is the default)
CREATE INDEX idx_users_email ON app.users (email);

-- Explicit B-tree (same thing , just spelling it out)
CREATE INDEX idx_users_created_at ON app.users USING BTREE (created_at);

-- Multi-column (composite) index - column order matters
CREATE INDEX idx_users_role_created ON app.users (role, created_at);
-- This index helps: WHERE role = 'admin' AND created_at > '2024-01-01'
-- This index helps: WHERE role = 'admin' ORDER BY created_at
-- This index does NOT help: WHERE created_at > '2024-01-01' (role not in query)

The column order in composite indexes is critical: put the column used in equality conditions first, then range/order columns. Postgres can use the index for the leading columns and skip to the next ones - but if the first column isn't in your WHERE clause, the index is useless for that query

Hash indexes - exact matches only¶

Hash indexes are optimized for equality comparisons (WHERE id = hash_value) and are smaller than B-tree for that specific use case. They don't support range queries or sorting. Since Postgres 10, hash indexes are WAL-logged and safe for production - before that they were crash-prone and nobody used them

CREATE INDEX idx_sessions_token_hash ON app.sessions USING HASH (token_hash);
-- Only useful for: WHERE token_hash = 'abc123...'
-- Useless for: WHERE token_hash > 'abc', ORDER BY token_hash

Hash indexes are niche. B-tree handles equality just fine for most workloads. Only consider hash when the index is frequently written to and you need the smaller storage footprint for equality-only lookups

GIN indexes - for JSONB and arrays¶

GIN (Generalized Inverted Index) is designed for composite types where each row contains multiple values: JSONB documents, arrays, full-text search vectors. GIN maps each element of the composite value back to the row, enabling fast searches within documents

-- Index JSONB data for fast key lookups
CREATE INDEX idx_users_metadata ON app.users USING GIN (metadata);
-- Now this is fast: SELECT * FROM users WHERE metadata @> '{"verified": true}'

-- Index array columns
CREATE INDEX idx_tags ON app.articles USING GIN (tags);
-- Fast array contains: SELECT * FROM articles WHERE tags @> ARRAY['security']

-- Full-text search index
CREATE INDEX idx_articles_search ON app.articles USING GIN (to_tsvector('english', body));
-- Enables: SELECT * FROM articles WHERE to_tsvector('english', body) @@ to_tsquery('sql & injection');

GiST indexes - geospatial and ranges¶

GiST (Generalized Search Tree) is for geographic coordinates, geometric data, and range types where you need "contains", "overlaps", or "nearest neighbor" operations

-- Requires PostGIS extension for geospatial
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE INDEX idx_locations_geo ON app.locations USING GIST (geom);
-- Enables: SELECT * FROM locations WHERE ST_DWithin(geom, ST_MakePoint(-73.9, 40.7), 1000);

EXPLAIN ANALYZE - stop guessing¶

The single most valuable debugging tool for slow queries is EXPLAIN ANALYZE which shows you exactly how Postgres executes a query: which indexes it uses, how many rows it examines, where the time goes. If you're optimizing queries without looking at the plan, you're operating on vibes not data

-- See the query plan without executing
EXPLAIN SELECT * FROM app.users WHERE email = 'omar@example.com';

-- Execute and measure (actually runs the query)
EXPLAIN ANALYZE SELECT * FROM app.users WHERE email = 'omar@example.com';

Output breakdown:

Seq Scan on users  (cost=0.00..35.50 rows=1 width=72) (actual time=0.021..12.342 rows=1 loops=1)
  Filter: ((email)::text = 'omar@example.com'::text)
  Rows Removed by Filter: 9999
Planning Time: 0.085 ms
Execution Time: 12.370 ms

When you see Seq Scan on a large table, that means Postgres read every single row (shown in Rows Removed by Filter). Adding an index changes this to Index Scan with dramatically lower execution time because Postgres traverses the tree instead of scanning sequentially

-- After creating the index
EXPLAIN ANALYZE SELECT * FROM app.users WHERE email = 'omar@example.com';

Index Scan using idx_users_email on users  (cost=0.28..8.29 rows=1 width=72) (actual time=0.019..0.020 rows=1 loops=1)
  Index Cond: ((email)::text = 'omar@example.com'::text)
Planning Time: 0.061 ms
Execution Time: 0.035 ms

12 milliseconds to 0.035 milliseconds - that's a 343x improvement from one index. This is why index tuning is non-negotiable for any database handling more than trivial traffic

Index-only scans¶

When all the columns a query needs exist within the index itself, Postgres can answer the query without touching the actual table at all. This is the fastest possible scan type because it reads only the compact index structure

-- Create covering index with INCLUDE (Postgres 11+)
CREATE INDEX idx_users_email_covering ON app.users (email) INCLUDE (username, role);
-- Now SELECT email, username, role FROM users WHERE email = 'x' can be index-only

-- Without INCLUDE, Postgres reads the index then fetches the table row
-- With INCLUDE, all needed data is in the index - one read , no table access

Detecting missing indexes¶

-- Find sequential scans on large tables (most-likely missing indexes)
SELECT
    schemaname,
    relname AS table_name,
    seq_scan,
    seq_tup_read,
    seq_tup_read / (seq_scan + 1) AS avg_rows_per_scan
FROM pg_stat_user_tables
WHERE seq_scan > 100
  AND seq_tup_read > 10000
ORDER BY seq_tup_read DESC;

-- Find queries waiting on locks (performance bottlenecks)
SELECT
    blocked.pid AS blocked_pid,
    blocker.pid AS blocker_pid,
    blocked.query AS blocked_query,
    blocker.query AS blocker_query
FROM pg_catalog.pg_locks blocked
JOIN pg_catalog.pg_locks blocker ON blocker.locktype = blocked.locktype
WHERE NOT blocked.granted;

Index maintenance¶

Indexes aren't free - they slow down writes (INSERT , UPDATE , DELETE) because the index structure must be updated with every data modification. On write-heavy tables with many indexes, each write operation multiplies by the number of indexes

-- Rebuild an index (reclaims space from deleted entries)
REINDEX INDEX idx_users_email;

-- Rebuild all indexes on a table
REINDEX TABLE app.users;

-- Monitor index size
SELECT
    indexname,
    indexdef,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_indexes
WHERE tablename = 'users'
ORDER BY pg_relation_size(indexrelid) DESC;

Security - index-based timing attacks¶

Indexes can be exploited for side-channel information disclosure. Attackers can use timing measurements to infer whether certain values exist in indexed columns through carefully crafted boolean-based blind queries where the response time difference between "found in index" (fast) and "not found in index" (sequential scan) reveals information about your data

-- Attacker can measure response times:
-- If this is fast -> user exists (index lookup)
-- If this is slow -> user doesn't exist (seq scan or full scan)
SELECT * FROM users WHERE email = 'admin@example.com';

Mitigation: ensure error messages don't reveal query plans , use consistent timing in authentication responses (always hash and compare, never short-circuit on "user not found"), and consider query statement timeout to cap worst-case execution time

-- Set statement timeout to prevent runaway queries
SET statement_timeout = '30s';

-- Also set at database level for safety
ALTER DATABASE mydb SET statement_timeout = '30000';

prerequisites¶

db_02_postgres_crud.md - you need a table with a few thousand rows to practice EXPLAIN ANALYZE on, otherwise the sequential scan vs index scan difference will be invisible

next → db_04_postgres_advanced.md