DevOps

All 43 notes on one page

Linux Fundamentals

1

Linux Filesystem and Navigation

beginner linux filesystem commands FHS

Every Linux system follows a standard directory layout called the Filesystem Hierarchy Standard (FHS). Once we learn where things live, navigating any Linux box feels familiar.

The FHS Tree

Linux Filesystem Hierarchy
/  ← root of everything
 ├── /home    ← user home directories (~)
 ├── /etc     ← system config files (nginx.conf, ssh, cron)
 ├── /var     ← variable data (logs, databases, mail)
 │   └── /var/log  ← system logs live here
 ├── /usr     ← user programs and libraries
 │   ├── /usr/bin   ← most user commands
 │   └── /usr/local  ← manually installed software
 ├── /bin     ← essential binaries (ls, cp, mv)
 ├── /sbin    ← system binaries (iptables, fdisk)
 ├── /tmp     ← temporary files (cleared on reboot)
 ├── /opt     ← optional/third-party software
 ├── /dev     ← device files (disks, terminals)
 └── /proc    ← virtual filesystem (process/system info)

The key takeaway: config goes in /etc, logs go in /var/log, and our stuff lives in /home.

Essential Navigation Commands

pwd                    # print current directory
ls -la                 # list all files with details (including hidden)
cd /var/log            # change to a specific directory
cd ~                   # go home (same as just "cd")
cd -                   # go back to the previous directory
mkdir -p app/src/utils # create nested directories in one shot

Viewing and Searching Files

cat config.json        # dump entire file to screen
head -20 access.log    # first 20 lines
tail -f app.log        # follow log output in real-time (Ctrl+C to stop)
wc -l data.csv         # count lines in a file

grep is our best friend for searching inside files.

grep "error" app.log              # find lines containing "error"
grep -i "warning" app.log         # case-insensitive search
grep -r "TODO" src/               # recursive search through directories
grep -n "function" index.js       # show line numbers

find searches for files by name, type, or age.

find /var/log -name "*.log"       # find all .log files
find . -type f -mtime -1          # files modified in the last 24 hours
find . -name "*.tmp" -delete      # find and delete .tmp files

Pipes and Redirection

Pipes (|) send the output of one command into another. Think of it like an assembly line.

cat access.log | grep "404" | wc -l   # count 404 errors
ps aux | grep nginx                     # find nginx processes
history | grep "docker"                 # search command history

Redirection sends output to files instead of the screen.

echo "hello" > file.txt     # write to file (overwrites!)
echo "world" >> file.txt    # append to file (safe)
cat missing.txt 2> err.log  # redirect errors (stderr) to a file
sort data.txt > sorted.txt  # sort and save to a new file

Wildcards

ls *.log          # all files ending in .log
ls app.???        # app followed by exactly 3 characters
rm temp_[0-9]*    # remove files starting with temp_ followed by a digit

Quick awk Intro

awk is great for pulling columns out of structured text.

# print the 1st and 3rd column from a space-separated file
awk '{print $1, $3}' access.log

# print lines where the 5th column is greater than 500
awk '$5 > 500' access.log

In simple language, the Linux filesystem is like a well-organized filing cabinet. Once we know which drawer (/etc, /var, /home) holds what, we can find anything fast.


2

File Permissions and Ownership

beginner linux permissions chmod chown

Linux is a multi-user system. Permissions control who can read, write, or execute a file. Every single file and directory has an owner, a group, and a set of permission bits.

Reading ls -l Output

When we run ls -l, we get something like this:

-rw-r--r-- 1 manish developers 4096 Mar 15 10:30 deploy.sh

Let’s break that down: the first character is the file type (- for file, d for directory, l for symlink). The next 9 characters are the permission bits.

Permission Bits Breakdown
type
-
owner
r w -
group
r - -
others
r - -
r = read (4) w = write (2) x = execute (1) - = no permission (0)

So rw-r--r-- means: owner can read+write, group can read, everyone else can read.

Numeric (Octal) Permissions

Each permission has a numeric value: r=4, w=2, x=1. We add them up per group.

chmod 755 deploy.sh    # rwxr-xr-x  (owner: all, group: read+exec, others: read+exec)
chmod 644 config.yml   # rw-r--r--  (owner: read+write, everyone else: read)
chmod 600 secret.key   # rw-------  (only owner can read+write)
chmod 700 scripts/     # rwx------  (only owner has full access)

Common combos to memorize: 755 for scripts/directories, 644 for regular files, 600 for secrets.

Symbolic Permissions

We can also use letters with +, -, and =.

chmod +x deploy.sh         # add execute for everyone
chmod u+w config.yml       # add write for owner (u=user/owner)
chmod g-w shared.txt       # remove write from group
chmod o= secret.key        # remove all permissions for others
chmod u=rwx,g=rx,o= dir/  # set exact permissions

The letters: u = owner, g = group, o = others, a = all.

Changing Ownership

chown manish file.txt             # change owner
chown manish:developers file.txt  # change owner AND group
chown -R manish:www-data /var/www # recursive ownership change
chgrp developers project/         # change group only

umask — Default Permissions

When we create a new file, umask determines the default permissions. It works by subtracting from the maximum.

umask              # show current mask (typically 0022)
umask 0027         # set new mask

# With umask 0022:
# new files  → 666 - 022 = 644 (rw-r--r--)
# new dirs   → 777 - 022 = 755 (rwxr-xr-x)

Special Permission Bits

These come up in interviews but are rarely changed day-to-day.

  • SUID (4xxx) — file runs as the file’s owner, not the user running it. Example: /usr/bin/passwd runs as root so users can change their own password.
  • SGID (2xxx) — on a directory, new files inherit the directory’s group. Great for shared project folders.
  • Sticky bit (1xxx) — on a directory, only the file owner can delete their files. /tmp has this so users can’t delete each other’s temp files.
chmod 4755 special_script  # set SUID
chmod 2775 shared_dir/     # set SGID
chmod 1777 /tmp            # set sticky bit (already set on /tmp)
ls -ld /tmp                # drwxrwxrwt  ← the "t" means sticky bit

In simple language, permissions are just a 3x3 grid: three groups of people (owner, group, others) each get three toggles (read, write, execute). That’s the whole system.


3

Process Management

beginner linux processes systemd signals

Every running program on Linux is a process. Understanding how to view, control, and kill processes is essential for any server work.

Processes vs Threads

A process is an independent program with its own memory space. A thread is a lightweight unit of execution inside a process that shares memory with other threads in the same process.

When we run node server.js, that’s one process. If Node spawns worker threads internally, those are threads within that process.

Process States

Linux Process Lifecycle
Created
fork/exec
Ready
waiting for CPU
Running
executing on CPU
Terminated
exit / killed
Sleeping / Waiting
waiting for I/O or event
↕ swaps with Running when I/O completes

A zombie process is one that finished but its parent hasn’t read its exit status yet. They’re harmless in small numbers but can pile up if the parent is buggy.

Viewing Processes

ps aux                      # list ALL processes with details
ps aux | grep nginx         # find a specific process
top                         # real-time process monitor (press q to quit)
htop                        # better version of top (install: apt install htop)
pgrep -f "node server"     # get PID of a process by name

In ps aux output, the columns we care about most: USER, PID, %CPU, %MEM, and COMMAND.

Killing Processes

Every process has a PID (process ID). We use kill to send signals to a process.

kill 1234              # send SIGTERM (graceful shutdown — "please stop")
kill -9 1234           # send SIGKILL (force kill — "stop NOW, no cleanup")
kill -HUP 1234         # send SIGHUP (reload config without restart)
killall nginx          # kill all processes named "nginx"
pkill -f "node app"    # kill processes matching a pattern

The key difference: SIGTERM (15) lets the process clean up (close connections, save state). SIGKILL (9) is instant death — the process can’t catch or ignore it. Always try SIGTERM first.

Foreground and Background

./long_task.sh         # runs in foreground (blocks terminal)
./long_task.sh &       # runs in background (terminal stays usable)
jobs                   # list background jobs in current shell
fg %1                  # bring job 1 back to foreground
bg %1                  # resume a stopped job in background
# Ctrl+Z              # pause (stop) a foreground process

nohup keeps a process running even after we close the terminal.

nohup ./deploy.sh &              # runs in background, survives logout
nohup ./deploy.sh > deploy.log 2>&1 &  # with custom log file

systemd — The Service Manager

Modern Linux uses systemd to manage services (daemons). Think of it as the boss that starts, stops, and babysits our services.

systemctl start nginx      # start a service
systemctl stop nginx       # stop a service
systemctl restart nginx    # stop + start
systemctl reload nginx     # reload config without downtime
systemctl status nginx     # check if it's running + recent logs
systemctl enable nginx     # start automatically on boot
systemctl disable nginx    # don't start on boot
systemctl list-units --type=service  # list all services

Checking Logs with journalctl

systemd captures logs from all services. journalctl is how we read them.

journalctl -u nginx              # logs for a specific service
journalctl -u nginx --since "1 hour ago"  # recent logs only
journalctl -u nginx -f           # follow logs in real-time (like tail -f)
journalctl -u nginx --no-pager   # dump all without paging

In simple language, processes are just programs doing their thing. We can watch them (ps, top), talk to them (kill signals), and let systemd babysit them so they stay running.


4

Shell Scripting Essentials

intermediate bash scripting shell automation

Shell scripting lets us automate repetitive tasks. Instead of typing 10 commands every time we deploy, we write a script once and run it forever. Every DevOps engineer writes bash scripts regularly.

The Shebang and Basics

Every script starts with a shebang — it tells the system which interpreter to use.

#!/bin/bash
# This is a comment
echo "Hello from the script!"

Make it executable and run it:

chmod +x deploy.sh
./deploy.sh

Variables

No spaces around =. That’s the #1 mistake beginners make.

#!/bin/bash
name="Manish"
port=3000
echo "Starting $name on port $port"
echo "Home directory is $HOME"   # environment variables work too

# Command substitution — capture a command's output
today=$(date +%Y-%m-%d)
echo "Today is $today"

Use "$variable" (with quotes) to avoid word-splitting issues. This is especially important when variables might contain spaces.

Conditionals

Bash uses if / then / elif / else / fi. The [[ ]] syntax is the modern way to write conditions.

#!/bin/bash
file="/etc/nginx/nginx.conf"

if [[ -f "$file" ]]; then
    echo "Config exists"
elif [[ -d "$file" ]]; then
    echo "It's a directory, not a file"
else
    echo "Config not found!"
    exit 1
fi

Common test flags:

  • -f file — file exists and is a regular file
  • -d dir — directory exists
  • -z "$var" — string is empty
  • -n "$var" — string is not empty
  • $a -eq $b — numeric equality
  • "$a" == "$b" — string equality

Loops

#!/bin/bash
# For loop — iterate over a list
for server in web1 web2 web3; do
    echo "Deploying to $server..."
    # ssh "$server" "cd /app && git pull"
done

# Loop over files
for file in /var/log/*.log; do
    echo "Processing $file"
done

# While loop — read a file line by line
while IFS= read -r line; do
    echo "Line: $line"
done < servers.txt

# C-style for loop
for ((i=1; i<=5; i++)); do
    echo "Attempt $i"
done

Functions

Functions help us organize scripts and avoid repeating code.

#!/bin/bash
log() {
    echo "[$(date '+%H:%M:%S')] $1"
}

check_service() {
    local service=$1   # local keeps the variable scoped to the function
    if systemctl is-active --quiet "$service"; then
        log "$service is running"
        return 0
    else
        log "$service is DOWN!"
        return 1
    fi
}

check_service "nginx"
check_service "postgresql"

Exit Codes and Error Handling

Every command returns an exit code. 0 means success, anything else means failure. We access it with $?.

#!/bin/bash
set -e          # exit immediately if any command fails
set -o pipefail # catch errors in pipes too
set -u          # treat unset variables as errors

# Combined — the holy trinity of safe scripts:
set -euo pipefail

# Check exit codes manually
if ! docker build -t myapp .; then
    echo "Build failed!"
    exit 1
fi

Argument Parsing

Scripts can accept arguments just like regular commands.

#!/bin/bash
# Usage: ./deploy.sh staging v1.2.3

env=$1          # first argument
version=$2      # second argument

echo "Deploying $version to $env"
echo "Total arguments: $#"
echo "All arguments: $@"

# Safety check
if [[ $# -lt 2 ]]; then
    echo "Usage: $0 <environment> <version>"
    exit 1
fi

Practical Example: Health Check Script

Here’s a real-world script that combines everything.

#!/bin/bash
set -euo pipefail

SERVICES=("nginx" "postgresql" "redis")
LOG_FILE="/var/log/health-check.log"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

for svc in "${SERVICES[@]}"; do
    if systemctl is-active --quiet "$svc"; then
        log "OK: $svc is running"
    else
        log "ALERT: $svc is down — attempting restart"
        systemctl restart "$svc"
    fi
done

In simple language, a bash script is just a text file full of commands we’d normally type by hand. Add some ifs and fors, and we’ve got automation.


5

Package Management and System Services

beginner linux apt yum systemctl cron

Linux has built-in package managers that handle installing, updating, and removing software. Think of them like an app store, but for the command line.

apt (Debian/Ubuntu) vs yum/dnf (RHEL/CentOS)

These are the two major package manager families. They do the same thing, just different commands.

# Debian/Ubuntu (apt)
sudo apt update                  # refresh package list (always do this first!)
sudo apt install nginx           # install a package
sudo apt remove nginx            # remove a package (keeps config files)
sudo apt purge nginx             # remove package AND config files
sudo apt upgrade                 # upgrade all installed packages
sudo apt autoremove              # clean up unused dependencies
apt search redis                 # search for a package
apt show nginx                   # show package details
# RHEL/CentOS/Fedora (yum or dnf)
sudo yum update                  # update all packages
sudo yum install nginx           # install
sudo yum remove nginx            # remove
yum search redis                 # search
yum info nginx                   # show details
# dnf is the newer replacement — same syntax as yum
sudo dnf install nginx

The key difference between apt remove and apt purge: remove keeps configuration files (so if we reinstall, our config is still there). Purge deletes everything.

Managing Services with systemctl

Once we install a service like nginx, we use systemctl to manage it.

sudo systemctl start nginx       # start now
sudo systemctl stop nginx        # stop now
sudo systemctl restart nginx     # stop + start
sudo systemctl reload nginx      # reload config (no downtime)
sudo systemctl status nginx      # check status and recent logs
sudo systemctl enable nginx      # auto-start on boot
sudo systemctl disable nginx     # don't auto-start on boot
sudo systemctl is-active nginx   # just check if running (for scripts)

A common pattern after installing something: enable it first (so it survives reboots), then start it.

sudo apt install nginx
sudo systemctl enable nginx
sudo systemctl start nginx

Reading Logs with journalctl

systemd captures all service logs. We use journalctl to read them.

journalctl -u nginx                       # all logs for nginx
journalctl -u nginx --since "30 min ago"  # recent logs
journalctl -u nginx -f                    # follow in real-time
journalctl -u nginx --no-pager -n 50      # last 50 lines, no pager
journalctl --disk-usage                   # how much space logs are using
sudo journalctl --vacuum-size=500M        # clean up logs over 500MB

Cron — Scheduling Tasks

Cron runs commands on a schedule. We edit our cron jobs with crontab -e.

The cron syntax has 5 fields:

Cron Expression Format
*
minute
0-59
*
hour
0-23
*
day
1-31
*
month
1-12
*
weekday
0-6 (Sun=0)
* = every  |  */5 = every 5  |  1,15 = specific values  |  1-5 = range

Common Cron Patterns

crontab -e    # edit cron jobs for current user
crontab -l    # list current cron jobs

# Examples:
* * * * *     /path/to/script.sh     # every minute
*/5 * * * *   /path/to/script.sh     # every 5 minutes
0 * * * *     /path/to/script.sh     # every hour (at minute 0)
0 2 * * *     /path/to/backup.sh     # daily at 2:00 AM
0 0 * * 0     /path/to/weekly.sh     # every Sunday at midnight
0 9 1 * *     /path/to/monthly.sh    # 1st of every month at 9 AM

Always redirect cron output to a log file so we can debug failures.

0 2 * * * /opt/backup.sh >> /var/log/backup.log 2>&1

The 2>&1 part redirects errors to the same log file as normal output.

Useful Tips

  • Use apt list --installed or yum list installed to see what’s installed
  • Use which nginx or command -v nginx to check if a binary is available
  • Use crontab.guru (website) to build and test cron expressions visually
  • System-wide cron jobs go in /etc/crontab or /etc/cron.d/

In simple language, package managers are how we install stuff, systemctl is how we keep it running, and cron is how we schedule it. These three tools cover most of our day-to-day Linux admin work.


Networking Essentials

6

OSI Model and TCP/IP

beginner networking osi tcp-ip layers

Networking can feel overwhelming until we realize it’s built in layers. Each layer handles one job and passes data to the next. The two models we need to know are OSI (theoretical, 7 layers) and TCP/IP (practical, 4 layers).

OSI vs TCP/IP Side by Side

OSI (7 Layers) vs TCP/IP (4 Layers)
OSI Model
7. Application (HTTP, DNS, FTP)
6. Presentation (TLS, JPEG, JSON)
5. Session (sockets, sessions)
4. Transport (TCP, UDP)
3. Network (IP, ICMP, routing)
2. Data Link (Ethernet, MAC)
1. Physical (cables, signals)
TCP/IP Model
Application
combines OSI layers 5-7
Transport
same as OSI layer 4
Internet
same as OSI layer 3
Network Access
combines OSI layers 1-2

In practice, we use the TCP/IP model. The OSI model is mostly for interviews and understanding concepts.

The 7 OSI Layers — With Analogies

Think of sending a letter. Each layer adds something to it:

  1. Physical — The road the mail truck drives on. Cables, Wi-Fi signals, electrical pulses.
  2. Data Link — The mail truck itself. Handles delivery between directly connected devices using MAC addresses. Ethernet lives here.
  3. Network — The postal routing system. Decides which path to take across networks using IP addresses. Routers operate here.
  4. Transport — The tracking number on our package. TCP ensures reliable delivery (with confirmation). UDP is like dropping a postcard in the mailbox — faster but no guarantee.
  5. Session — Opening and closing the conversation. Keeps track of “who’s talking to whom.”
  6. Presentation — Translation and formatting. Encryption (TLS), compression, character encoding.
  7. Application — The actual letter content. HTTP, DNS, SMTP, FTP — the protocols our apps talk.

How Data Flows — Encapsulation

When we send an HTTP request, data travels down the layers. Each layer wraps the data with its own header. This is called encapsulation.

Application:  [HTTP Data]
Transport:    [TCP Header][HTTP Data]            → called a "segment"
Network:      [IP Header][TCP Header][Data]      → called a "packet"
Data Link:    [Frame Header][IP][TCP][Data][FCS] → called a "frame"
Physical:     01101001011... (bits on the wire)

On the receiving end, each layer strips off its header and passes the data up. Like opening nested envelopes.

Which Protocols Live Where

LayerProtocolsDevices
ApplicationHTTP, HTTPS, DNS, FTP, SSH, SMTP-
TransportTCP, UDP-
NetworkIP, ICMP, ARPRouters
Data LinkEthernet, Wi-Fi (802.11)Switches
PhysicalCables, fiber, radio wavesHubs, repeaters

Why This Matters for DevOps

When debugging network issues, layers help us isolate the problem:

  • Can’t reach the server at all? Probably Layer 3 (routing, IP).
  • Connection drops after opening? Layer 4 (TCP, firewall blocking ports).
  • Getting 502 errors? Layer 7 (application, reverse proxy misconfigured).

In simple language, the OSI model is just a way to organize how computers talk to each other. Each layer has one job, and they stack on top of each other like building blocks.


7

DNS and Domain Resolution

beginner dns networking records resolution

DNS (Domain Name System) is like the phone book of the internet. We type google.com, and DNS figures out the IP address (like 142.250.80.46) so our browser knows where to connect.

Without DNS, we’d have to memorize IP addresses for every website. Nobody wants that.

The Resolution Flow

When we type a URL in the browser, here’s what happens behind the scenes:

DNS Resolution Flow
1 Browser cache → already visited this site? ✓ done
2 OS cache → checked /etc/hosts and system cache ✓ done
3 Recursive Resolver → ISP or configured DNS (8.8.8.8, 1.1.1.1) ✓ if cached
4 Root Server → "I don't know, but ask the .com TLD server"
5 TLD Server (.com) → "Ask the authoritative server for google.com"
6 Authoritative Server → "google.com is 142.250.80.46" 🎯
Result gets cached at each level so the next lookup is faster

Most lookups never reach step 4 because caching is aggressive. The resolver already knows the answer from previous queries.

DNS Record Types

These are the building blocks of DNS configuration.

RecordPurposeExample
AMaps domain to IPv4 addresspman47.cc → 144.24.126.230
AAAAMaps domain to IPv6 addresspman47.cc → 2001:db8::1
CNAMEAlias pointing to another domainwww.pman47.cc → pman47.cc
MXMail server for the domainpman47.cc → mail.pman47.cc (priority 10)
TXTArbitrary text (SPF, DKIM, verification)v=spf1 include:_spf.google.com ~all
NSWhich nameservers are authoritativepman47.cc → ns1.hostinger.com

A CNAME can’t coexist with other records on the same name. So we can’t have a CNAME at the root domain (pman47.cc) — only on subdomains (www.pman47.cc). Some providers offer “ALIAS” or “ANAME” records to work around this.

TTL — Time to Live

Every DNS record has a TTL (in seconds) that tells caches how long to keep the answer.

  • TTL 300 → cache for 5 minutes (good for records we might change)
  • TTL 86400 → cache for 24 hours (good for stable records)
  • Before a migration, lower the TTL to 60-300 so changes propagate fast

DNS Lookup Commands

dig and nslookup are our go-to tools for DNS debugging.

# dig — the gold standard
dig pman47.cc                    # query A record
dig pman47.cc AAAA               # query IPv6 record
dig pman47.cc MX                 # query mail records
dig pman47.cc +short             # just the IP, no extra info
dig @8.8.8.8 pman47.cc          # query using Google's DNS

# nslookup — simpler alternative
nslookup pman47.cc
nslookup -type=MX pman47.cc
# Trace the full resolution path
dig pman47.cc +trace

# Check all record types
dig pman47.cc ANY +short

Common DNS Gotchas

  • Propagation delay — DNS changes can take minutes to hours to spread worldwide because of caching. Lower the TTL before making changes.
  • CNAME at root — Most providers don’t allow it. Use an A record for the naked domain.
  • /etc/hosts — Local override file. The OS checks this before DNS. Useful for testing: 127.0.0.1 myapp.local.
  • DNS caching — If things look wrong after a change, flush the local cache: sudo systemd-resolve --flush-caches (Linux) or sudo dscacheutil -flushcache (macOS).

In simple language, DNS is just a giant distributed phone book. We give it a name, it gives us back a number. The whole system is designed to be fast (caching) and reliable (multiple levels of servers).


8

HTTP, HTTPS, and TLS

intermediate http https tls certificates

HTTP is how browsers and servers talk. HTTPS is the same thing but encrypted with TLS. Almost every web request we make uses one of these protocols.

HTTP Methods

Each method tells the server what we want to do.

MethodPurposeIdempotent?Has Body?
GETRead/fetch dataYesNo
POSTCreate something newNoYes
PUTReplace entirelyYesYes
PATCHUpdate partiallyNoYes
DELETERemove somethingYesNo

Idempotent means calling it 10 times has the same effect as calling it once. PUT /users/5 with the same data always sets the same state. POST /users creates a new user each time.

Status Code Families

1xx — Informational  (rarely seen: 100 Continue, 101 Switching Protocols)
2xx — Success        (the happy path)
3xx — Redirection    (go look somewhere else)
4xx — Client Error   (we messed up)
5xx — Server Error   (the server messed up)

The ones that come up constantly:

CodeMeaningWhen We See It
200OKEverything worked
201CreatedPOST succeeded, resource created
204No ContentDELETE succeeded, nothing to return
301Moved PermanentlyURL changed, update bookmarks
302Found (temporary redirect)Redirect but URL might come back
304Not ModifiedCached version is still fresh
400Bad RequestMalformed request (bad JSON, missing fields)
401UnauthorizedNot authenticated (need to log in)
403ForbiddenAuthenticated but not allowed
404Not FoundResource doesn’t exist
429Too Many RequestsRate limited
500Internal Server ErrorServer crashed
502Bad GatewayReverse proxy can’t reach the backend
503Service UnavailableServer overloaded or in maintenance
504Gateway TimeoutBackend took too long to respond

The difference between 401 and 403: 401 means “who are you?” (not logged in). 403 means “I know who you are, but you can’t do this.”

Key Headers

Content-Type: application/json        # what format the body is in
Authorization: Bearer eyJhbGci...     # auth token
Cache-Control: max-age=3600           # cache for 1 hour
Accept: application/json              # what format we want back
X-Request-Id: abc-123                 # tracking ID for debugging

HTTP/1.1 vs HTTP/2 vs HTTP/3

  • HTTP/1.1 — One request per connection (or keep-alive for reuse). Text-based. Still widely used.
  • HTTP/2 — Multiplexing (many requests over one connection), header compression, server push. Binary protocol. Much faster for websites with lots of assets.
  • HTTP/3 — Uses QUIC (built on UDP instead of TCP). Faster connection setup, better for mobile/lossy networks. Still rolling out.

The only difference we usually care about: HTTP/2 is way faster for loading web pages because it doesn’t wait for one request to finish before starting the next.

TLS — How HTTPS Works

HTTPS = HTTP + TLS encryption. TLS (Transport Layer Security) ensures nobody can eavesdrop or tamper with the data in transit.

TLS 1.3 Handshake (Simplified)
Client
Server
Client → ClientHello + key share Server
Client ← ServerHello + cert + key share Server
Client → Finished (encrypted!) Server
TLS 1.3 needs only 1 round-trip (vs 2 in TLS 1.2)

Here’s what happens in plain English:

  1. Client says “hello, here are the encryption methods I support and my key share”
  2. Server picks a method, sends its certificate (proof of identity) and its key share
  3. Both sides now have a shared secret — all further traffic is encrypted

Certificates

A TLS certificate proves “this server really is google.com.” Certificates are issued by Certificate Authorities (CAs).

  • Let’s Encrypt — free, automated certificates (90-day validity, auto-renewed)
  • Caddy — a web server that handles Let’s Encrypt certificates automatically with zero config
  • Certificates contain: domain name, public key, issuer, expiration date
# Check a site's certificate
openssl s_client -connect pman47.cc:443 -servername pman47.cc </dev/null 2>/dev/null | openssl x509 -text -noout | head -20

# Quick expiry check
echo | openssl s_client -connect pman47.cc:443 2>/dev/null | openssl x509 -noout -dates

In simple language, HTTP is the language browsers and servers speak. TLS wraps that conversation in an encrypted envelope so nobody in the middle can read it. Together, they’re HTTPS — and that’s why we see the padlock icon in the browser.


9

TCP vs UDP

beginner tcp udp networking protocols

TCP and UDP are both transport layer protocols — they move data between applications. The key difference: TCP guarantees delivery (reliable but slower). UDP doesn’t guarantee anything (fast but lossy). Choosing between them is all about the tradeoff.

Side-by-Side Comparison

TCP (Transmission Control Protocol)
✓ Connection-oriented (handshake first)
✓ Guaranteed delivery
✓ Ordered packets
✓ Error checking + retransmission
✓ Flow control (won't overwhelm receiver)
Think: registered mail with tracking
UDP (User Datagram Protocol)
✗ Connectionless (just send it)
✗ No delivery guarantee
✗ No ordering
✗ No retransmission
✓ Much less overhead → faster
Think: dropping postcards in a mailbox

TCP Three-Way Handshake

Before any data flows over TCP, the client and server do a handshake to establish the connection. Think of it as a “are you there?” check.

Client → Server:  SYN         "Hey, I want to connect"
Server → Client:  SYN-ACK     "Got it, I'm ready too"
Client → Server:  ACK         "Great, let's go"

After these 3 packets, the connection is open and data starts flowing. This is why TCP has more latency than UDP — we pay an upfront cost before any real data moves.

To close the connection, there’s a four-way teardown (FIN → ACK → FIN → ACK).

How TCP Ensures Reliability

TCP does a lot of work behind the scenes:

  • Sequence numbers — every byte gets a number so the receiver can reorder packets
  • Acknowledgments (ACK) — the receiver confirms what it got
  • Retransmission — if an ACK doesn’t come back in time, TCP resends the data
  • Flow control — the receiver tells the sender “slow down, I’m full” using a window size
  • Congestion control — TCP detects network congestion and backs off automatically

UDP — The “Fire and Forget” Protocol

UDP just wraps data in a thin header and sends it. No handshake, no acknowledgment, no ordering.

The UDP header is only 8 bytes (vs TCP’s 20+ bytes). That’s why it’s faster — less overhead per packet.

Source Port | Destination Port | Length | Checksum | Data

That’s it. No sequence numbers, no flow control, no nothing.

When to Use Which

Use CaseProtocolWhy
Web browsing (HTTP/HTTPS)TCPNeed reliable, ordered delivery
Database connectionsTCPCan’t afford to lose queries
Email (SMTP)TCPMessages must arrive complete
SSHTCPEvery keystroke must arrive
Video streamingUDPA dropped frame is fine, lag is not
Online gamingUDPSpeed > perfect delivery
DNS queriesUDPSmall, fast, one-shot queries
Voice calls (VoIP)UDPReal-time, slight loss is OK
IoT sensorsUDPLightweight, frequent updates

The rule of thumb: if losing a packet would break things (web, email, files), use TCP. If speed matters more than perfection (video, gaming, real-time), use UDP.

Ports

Both TCP and UDP use ports to route traffic to the right application. A port is just a number from 0 to 65535.

# Common well-known ports
22 SSH (TCP)
53 DNS (UDP/TCP)
80 HTTP (TCP)
443 HTTPS (TCP)
3306 MySQL (TCP)
5432 PostgreSQL (TCP)
6379 Redis (TCP)

Ports 0-1023 are well-known ports (need root to bind). Ports 1024-49151 are registered ports. Ports 49152-65535 are ephemeral (assigned dynamically to clients).

# Check what's listening on which ports
ss -tlnp        # TCP listeners with process names
ss -ulnp        # UDP listeners with process names

HTTP/3 and QUIC

Here’s a fun fact: HTTP/3 runs over QUIC, which is built on UDP (not TCP). QUIC adds its own reliability and encryption on top of UDP, getting the speed benefits of UDP while still being reliable. It’s like building a better TCP from scratch using UDP as the foundation.

In simple language, TCP is the careful, reliable friend who double-checks everything. UDP is the fast friend who sends stuff and hopes for the best. We pick based on whether we need reliability or speed.


10

Load Balancing

intermediate load-balancer networking high-availability

When one server can’t handle all the traffic, we put multiple servers behind a load balancer. It distributes incoming requests across those servers so no single one gets overwhelmed. This gives us both scalability (handle more traffic) and high availability (if one server dies, others keep serving).

L4 vs L7 Load Balancing

Load balancers operate at different layers of the network stack. The two we care about are Layer 4 (transport) and Layer 7 (application).

L4 — Transport Layer
Routes based on IP + port
Can't see HTTP headers or URLs
Just forwards TCP/UDP connections
Very fast — minimal processing
Good for: databases, non-HTTP traffic
Decision: "Send this TCP connection to server B"
L7 — Application Layer
Routes based on URL, headers, cookies
Can inspect HTTP request content
Can rewrite URLs, add headers
Slower — needs to parse HTTP
Good for: web apps, API routing
Decision: "Send /api/* to backend, /static/* to CDN"

Most web applications use L7 load balancing because we want to route based on URL paths, headers, or cookies.

Load Balancing Algorithms

How does the load balancer decide which server gets the next request?

  • Round Robin — Takes turns: server 1, server 2, server 3, repeat. Simple and fair, but ignores server load.
  • Weighted Round Robin — Same but some servers get more turns. Server A (weight 3) gets 3x the traffic of server B (weight 1). Useful when servers have different capacities.
  • Least Connections — Sends to the server with the fewest active connections. Smart choice when requests take varying amounts of time.
  • IP Hash — Hashes the client’s IP to pick a server. Same client always goes to the same server. Good for simple session persistence.
  • Random — Pick a server at random. Surprisingly effective at scale.

Nginx Load Balancer Config

Here’s what a basic L7 load balancer looks like with Nginx.

# nginx.conf
upstream backend {
    # Least connections algorithm
    least_conn;

    server 10.0.0.1:3000 weight=3;   # gets 3x traffic
    server 10.0.0.2:3000;             # weight=1 (default)
    server 10.0.0.3:3000 backup;      # only used if others are down
}

server {
    listen 80;
    server_name myapp.com;

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Health Checks

A load balancer needs to know if a server is alive. It periodically sends requests to each server and removes unhealthy ones from the pool.

  • Active health checks — The LB pings each server (e.g., GET /health every 10s). If 3 checks fail, the server is marked down.
  • Passive health checks — The LB watches real traffic. If a server starts returning 5xx errors, it gets pulled out.
# A simple health endpoint in any app
# GET /health → 200 OK means "I'm alive"
# Returns: {"status": "ok", "uptime": 12345}

Sticky Sessions

Sometimes we need the same user to always reach the same server (e.g., if session data is stored in server memory). This is called session affinity or sticky sessions.

Methods:

  • Cookie-based — The LB sets a cookie (like SERVERID=web2) and uses it for routing
  • IP-based — Route based on client IP (breaks with shared IPs / proxies)

The better solution is usually to avoid sticky sessions altogether by using a shared session store like Redis.

Common Load Balancing Tools

ToolTypeNotes
NginxL4/L7Most popular for web, great L7 support
HAProxyL4/L7Battle-tested, amazing performance
CaddyL7Auto HTTPS, simple config
AWS ALBL7Managed, integrates with AWS services
AWS NLBL4Managed, ultra-low latency
GCP Load BalancerL4/L7Global, anycast-based
TraefikL7Auto-discovers containers, great for Docker/K8s

In simple language, a load balancer is like a traffic cop standing in front of our servers. It sends each car (request) down a different road (server) so no single road gets jammed. If a road is closed (server down), it redirects traffic to the open ones.


11

Networking Tools and Troubleshooting

intermediate curl netstat tcpdump debugging

Knowing the right debugging tools is half the battle when something goes wrong in production. Let’s go through the most important networking tools and how to use them when things break.

curl — The Swiss Army Knife

curl is the most versatile HTTP tool. We use it to test APIs, check headers, debug requests, and more.

# Basic GET request
curl https://api.example.com/users

# Common flags
curl -v https://example.com           # verbose — shows headers, TLS handshake, everything
curl -I https://example.com           # HEAD request — just headers, no body
curl -o file.zip https://example.com/file.zip  # save output to file
curl -L https://example.com           # follow redirects (3xx)
curl -s https://example.com           # silent mode (no progress bar)

# POST with JSON body
curl -X POST https://api.example.com/users \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token123" \
  -d '{"name": "Manish", "role": "admin"}'

# Check response time
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
  -o /dev/null -s https://example.com

Pro tip: -v (verbose) is our best friend when debugging. It shows the full request/response including TLS negotiation.

ping — Basic Connectivity Check

ping sends ICMP packets to check if a host is reachable and measure latency.

ping google.com              # continuous ping (Ctrl+C to stop)
ping -c 4 google.com         # send exactly 4 pings
ping -c 4 -W 2 10.0.0.5    # 2-second timeout per ping

If ping fails, it means either the host is down, there’s a network issue, or ICMP is blocked (many servers block ping for security).

traceroute — Map the Route

traceroute shows every hop (router) between us and the destination. Great for finding where packets get stuck.

traceroute google.com        # show all hops
traceroute -n google.com     # skip DNS lookups (faster)
# On some systems, use tracepath instead
tracepath google.com

If we see * * * at a specific hop, that router isn’t responding to our probes. If everything after a certain hop is * * *, that’s likely where the problem is.

ss / netstat — Check Open Ports and Connections

ss (socket statistics) replaced netstat on modern Linux. Same idea, faster output.

# List all listening TCP ports with process names
ss -tlnp

# List all listening UDP ports
ss -ulnp

# Show all established connections
ss -tnp

# Find what's using port 3000
ss -tlnp | grep 3000

# The older netstat equivalent (still works)
netstat -tlnp
netstat -anp | grep :80

The flags: -t = TCP, -u = UDP, -l = listening, -n = numeric (don’t resolve names), -p = show process.

nslookup / dig — DNS Debugging

When DNS seems wrong, these tools help us verify what’s actually resolving.

# Quick DNS lookup
nslookup pman47.cc
dig pman47.cc +short

# Check specific record types
dig pman47.cc MX +short
dig pman47.cc TXT +short

# Use a specific DNS server
dig @8.8.8.8 pman47.cc       # Google DNS
dig @1.1.1.1 pman47.cc       # Cloudflare DNS

# Full trace to see the resolution chain
dig pman47.cc +trace

tcpdump — Packet Capture

tcpdump captures raw network packets. It’s the nuclear option for debugging — we can see exactly what’s going over the wire.

# Capture all traffic on eth0
sudo tcpdump -i eth0

# Capture only HTTP traffic (port 80)
sudo tcpdump -i any port 80

# Capture traffic to/from a specific host
sudo tcpdump -i any host 10.0.0.5

# Save capture to a file (open in Wireshark later)
sudo tcpdump -i any -w capture.pcap

# Human-readable output with timestamps
sudo tcpdump -i any port 443 -A -tttt

tcpdump is powerful but noisy. Always filter by port or host to keep the output manageable.

iptables — Firewall Basics

iptables controls the Linux firewall. Most servers use it (or nftables/ufw which wrap it).

# List current rules
sudo iptables -L -n -v

# Allow incoming SSH
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT

# Allow incoming HTTP and HTTPS
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# Block a specific IP
sudo iptables -A INPUT -s 203.0.113.50 -j DROP

# UFW is the friendlier alternative (Ubuntu)
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw enable
sudo ufw status

Debugging Workflow: “The Website Is Down”

When someone says the site is down, here’s a systematic approach:

  1. Can we reach it at all?

    ping myapp.com
  2. Is DNS resolving correctly?

    dig myapp.com +short
  3. Is the port open and the service listening?

    ss -tlnp | grep :80
    curl -v http://myapp.com
  4. Is it a specific path or the whole site?

    curl -I https://myapp.com/
    curl -I https://myapp.com/api/health
  5. Check the service logs

    journalctl -u nginx --since "10 min ago"
    docker logs myapp --tail 50
  6. Check system resources

    top                  # CPU and memory
    df -h                # disk space (full disk = silent death)
    free -h              # memory

In simple language, debugging network issues is like following a trail. We start at the beginning (can we even reach the server?) and follow the path until we find where things break. These tools let us check every step along the way.


Docker & Containers

12

Containers vs Virtual Machines

beginner containers vms namespaces cgroups

Before containers existed, the only way to isolate applications was with Virtual Machines. They work, but they’re heavy. Containers changed the game by being ridiculously lightweight. Let’s break down how each one works and when to pick which.

How Virtual Machines work

A VM runs a full operating system on top of a piece of software called a hypervisor. The hypervisor sits between the physical hardware and the VMs, slicing up CPU, memory, and storage for each one.

In simple language, a VM is like renting an entire apartment. We get our own kitchen, bathroom, walls — everything. Nobody shares anything with us, but it takes a lot of resources to maintain.

Each VM has its own kernel, its own system libraries, and its own binaries. That’s why a typical VM image is gigabytes in size and takes minutes to boot.

How Containers work

Containers share the host’s kernel. They don’t need their own OS. Instead, they use two Linux kernel features to create isolation:

  • Namespaces — give each container its own view of the system (its own process tree, network stack, mount points, user IDs). The container thinks it’s alone on the machine.
  • cgroups (control groups) — limit how much CPU, memory, and I/O a container can use. This stops one container from eating all the resources.

In simple language, a container is like renting a room in a co-living space. We get our own room (namespace), there’s a rule about how much fridge space we can use (cgroups), but we all share the same kitchen and building infrastructure (kernel).

Virtual Machine
App A
App B
Guest OS
Guest OS
Hypervisor
Hardware
Container
App A
App B
Container Runtime (Docker)
Host OS Kernel
Hardware

Why containers are fast

Since containers don’t boot an entire OS, they start in milliseconds. A typical container image is megabytes, not gigabytes. And because they share the host kernel, we can run dozens of containers on a machine that would struggle with 5 VMs.

# Start a container — takes less than a second
docker run -d --name my-app nginx:alpine

# Start and stop are nearly instant
docker stop my-app   # ~10 seconds (graceful shutdown)
docker start my-app  # ~1 second

When to use each

Use containers when: we want fast deploys, consistent environments, microservices, CI/CD pipelines, or running many isolated apps on one host.

Use VMs when: we need full OS-level isolation (different kernels), running Windows on Linux, strict security boundaries (think multi-tenant cloud), or legacy apps that need a specific OS.

In practice, most modern workloads run in containers. VMs are still used underneath — cloud providers run our containers inside VMs for that extra security layer. Think of it like containers inside VMs. We get the speed of containers and the isolation of VMs.


13

Docker Images & Layers

beginner docker images layers registry

A Docker image is a read-only template that contains everything needed to run an application — the code, runtime, libraries, environment variables, and config files. When we run an image, it becomes a container (a running instance of that image).

Think of it like a class vs an object. The image is the class. The container is the object we create from it.

Image layers

Every Docker image is made up of stacked layers. Each instruction in a Dockerfile (like RUN, COPY, ADD) creates a new layer. These layers are read-only and stacked on top of each other.

When we run a container, Docker adds a thin writable layer on top. Any changes we make inside the container (editing files, writing logs) happen in this writable layer. The original image layers stay untouched.

Docker Image Layers
Writable Layer (container only)
Layer 4: CMD ["node", "app.js"]
Layer 3: COPY . /app
Layer 2: RUN npm install
Layer 1: FROM node:20-alpine (base image)
Each layer is read-only and identified by a SHA256 hash

Layer caching and sharing

This is where layers get really clever. Docker caches each layer. If we rebuild an image and a layer hasn’t changed, Docker reuses the cached version instead of rebuilding it. This makes builds fast.

Even better, different images can share layers. If we have 5 Node.js apps all using node:20-alpine as the base, that base layer is stored only once on disk. All 5 images point to the same layer.

# See the layers of an image
docker history nginx:alpine

# See image size (shared layers don't count twice)
docker system df

# Inspect detailed layer info
docker inspect nginx:alpine

Pulling from registries

A registry is a storage service for Docker images. When we do docker pull nginx, Docker downloads the image from a registry.

Common registries:

  • Docker Hub (docker.io) — the default, largest public registry
  • GitHub Container Registry (ghcr.io) — tied to our GitHub repos
  • AWS ECR, Google GCR, Azure ACR — cloud-specific registries
# Pull from Docker Hub (default)
docker pull nginx:alpine

# Pull from GitHub Container Registry
docker pull ghcr.io/pman47/gyaan:latest

# Push our own image to a registry
docker tag my-app:latest ghcr.io/username/my-app:latest
docker push ghcr.io/username/my-app:latest

Tags vs Digests

We reference images in two ways:

  • Tags are human-readable labels like nginx:alpine or node:20-slim. Tags are mutable — the owner can push a new image under the same tag. So node:20-alpine today might be different from node:20-alpine next month.
  • Digests are the SHA256 hash of the image manifest. They’re immutable and always point to the exact same image.
# Pull by tag (mutable — image can change)
docker pull nginx:1.27

# Pull by digest (immutable — always the exact same image)
docker pull nginx@sha256:abc123def456...

# See the digest of a pulled image
docker images --digests

For production deployments, using digests (or at least specific version tags like node:20.11.1-alpine) is safer than using generic tags like latest or node:20. We don’t want a surprise update breaking our app.


14

Dockerfile Best Practices

intermediate docker dockerfile multi-stage optimization

A Dockerfile is just a text file with instructions to build an image. Every line creates a layer. The order we write these instructions matters a LOT for build speed and image size.

The basic instructions

Here’s what the most common instructions do:

  • FROM — sets the base image (every Dockerfile starts here)
  • WORKDIR — sets the working directory inside the container
  • COPY — copies files from our machine into the image
  • RUN — executes a command during build (install packages, compile code)
  • EXPOSE — documents which port the app listens on (doesn’t actually publish it)
  • ENV — sets environment variables
  • CMD — the default command when the container starts
  • ENTRYPOINT — like CMD, but harder to override
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

CMD vs ENTRYPOINT

This trips people up. Both define what runs when the container starts, but they behave differently.

CMD — provides a default command that can be easily overridden by passing arguments to docker run.

ENTRYPOINT — sets the main executable. Arguments passed to docker run get appended to it.

In simple language, think of ENTRYPOINT as the verb and CMD as the default arguments. We usually combine them when we want a fixed command but flexible arguments.

# CMD only — can be fully overridden
CMD ["node", "server.js"]
# docker run my-app                  → node server.js
# docker run my-app node other.js    → node other.js (overridden)

# ENTRYPOINT + CMD — fixed command, default args
ENTRYPOINT ["node"]
CMD ["server.js"]
# docker run my-app                  → node server.js
# docker run my-app other.js         → node other.js (args appended)

Layer caching — order matters

Docker caches layers from top to bottom. The moment a layer changes, everything below it is rebuilt. So we want to put things that change frequently at the bottom.

This is why we copy package.json first and run npm install before copying our source code. Dependencies don’t change often, but our code does. This way, npm install is cached on most builds.

# Good — dependencies cached separately from source code
COPY package*.json ./
RUN npm ci
COPY . .

# Bad — any code change invalidates npm install cache
COPY . .
RUN npm ci

Multi-stage builds

Multi-stage builds let us use one image for building and a different (smaller) one for running. The build artifacts are copied between stages, but build tools don’t end up in the final image.

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Production (much smaller image)
FROM nginx:1.27-alpine
COPY --from=builder /app/dist /usr/share/nginx/html
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

The final image only has nginx and our built files. Node.js, npm, and all dev dependencies are left behind in the builder stage.

Reducing image size

Every megabyte matters. Smaller images mean faster pulls, faster deploys, and less attack surface.

# Use alpine base images (5MB vs 900MB+)
FROM node:20-alpine

# Combine RUN commands to reduce layers
RUN apk add --no-cache git curl && \
    rm -rf /var/cache/apk/*

# Use .dockerignore to exclude junk from COPY
# .dockerignore file:
# node_modules
# .git
# *.md
# .env

Security basics

Running as root inside a container is a bad idea. If an attacker breaks out, they’re root on the host too.

FROM node:20-alpine
WORKDIR /app
COPY --chown=node:node . .

# Switch to non-root user
USER node

CMD ["node", "server.js"]

Other security rules:

  • Never put secrets in the Dockerfile — no ENV API_KEY=abc123. Use runtime env vars or secret managers instead.
  • Use specific image tagsnode:20.11.1-alpine not node:latest.
  • Scan imagesdocker scout cves my-image to find vulnerabilities.

15

Docker Networking

intermediate docker networking bridge overlay

When we run multiple containers, they need to talk to each other. A web app needs to reach its database. A backend needs to hit Redis. Docker networking is how we connect them.

Network types

Docker has four built-in network drivers:

  • bridge — the default. Containers get their own IP on an internal network. Most common for single-host setups.
  • host — removes network isolation. The container shares the host’s network directly. Fast, but no port isolation.
  • none — no networking at all. The container is completely isolated.
  • overlay — connects containers across multiple Docker hosts. Used in Docker Swarm and orchestration setups.

Default bridge vs custom bridge

When we run a container without specifying a network, it goes on the default bridge. This works, but it has a big limitation — containers can only reach each other by IP address, not by name.

Custom bridge networks solve this. They give us automatic DNS resolution — containers can find each other by name. This is what we almost always want.

Custom Bridge Network: "my-network"
web-app
172.18.0.2
port 3000
← DNS →
postgres
172.18.0.3
port 5432
← DNS →
redis
172.18.0.4
port 6379
web-app can reach postgres by name: "postgres:5432"
↕ port mapping -p 8080:3000
Host Machine (localhost:8080)
# Create a custom bridge network
docker network create my-network

# Run containers on that network
docker run -d --name postgres --network my-network postgres:16-alpine
docker run -d --name web-app --network my-network -p 8080:3000 my-app

# web-app can now connect to postgres using the hostname "postgres"
# e.g., connection string: postgresql://user:pass@postgres:5432/mydb

Port mapping

Containers have their own network namespace. To make a container reachable from the host (or the internet), we map a host port to a container port using -p.

# Map host port 8080 → container port 3000
docker run -d -p 8080:3000 my-app

# Map to a specific interface (only localhost, not external)
docker run -d -p 127.0.0.1:8080:3000 my-app

# Map a random host port
docker run -d -p 3000 my-app
docker port <container-id>   # see which port was assigned

Container-to-container communication

On the same custom network, containers can talk to each other using their container name as the hostname. No port mapping needed — they communicate over the internal network directly.

On different networks, containers are completely isolated from each other. We can connect a container to multiple networks if it needs to talk to containers on different networks.

# Connect a running container to another network
docker network connect backend-network my-app

# Disconnect from a network
docker network disconnect frontend-network my-app

# List all networks
docker network ls

# Inspect a network (see which containers are on it)
docker network inspect my-network

When to use each type

  • Custom bridge — 90% of the time. Multi-container apps on a single host. Use this as the default.
  • Host — when we need maximum network performance and don’t care about port isolation. Common for monitoring tools.
  • None — for containers that should have zero network access (batch jobs processing local files).
  • Overlay — multi-host setups with Docker Swarm or when containers on different machines need to talk.

16

Docker Volumes & Storage

intermediate docker volumes storage persistence

Here’s the thing about containers — they’re ephemeral. When a container is removed, all data inside it is gone. That writable layer we talked about in the images doc? It lives and dies with the container.

So if we’re running a database in a container and the container crashes or gets replaced during a deploy, all our data vanishes. That’s where volumes come in.

Three types of storage

Docker gives us three ways to persist data:

Volumes
Managed by Docker
Stored in /var/lib/docker/volumes/
Best for production data
Bind Mounts
Maps a host path
Any path on host machine
Best for development
tmpfs
In-memory only
Never written to disk
Best for sensitive temp data

Volumes are completely managed by Docker. We don’t need to care about the exact path on the host. Docker handles it. They also work on both Linux and macOS/Windows.

# Create a named volume
docker volume create my-data

# Run a container with a volume mounted
docker run -d --name db \
  -v my-data:/var/lib/postgresql/data \
  postgres:16-alpine

# The data survives container removal
docker rm -f db
# my-data still exists — start a new container with it
docker run -d --name db-new \
  -v my-data:/var/lib/postgresql/data \
  postgres:16-alpine

# List all volumes
docker volume ls

# Inspect a volume
docker volume inspect my-data

Bind mounts — great for development

Bind mounts map a specific directory on our host machine into the container. This is perfect for development because we can edit code on our machine and see changes instantly inside the container.

# Mount current directory into /app in the container
docker run -d --name dev-server \
  -v $(pwd):/app \
  -p 3000:3000 \
  node:20-alpine npm run dev

# Changes to files on our machine show up immediately in the container

The only difference from volumes is that we specify a full path on the host instead of a volume name. If the path starts with / or ./, Docker treats it as a bind mount. If it’s just a name, it’s a volume.

tmpfs — in-memory storage

tmpfs mounts store data in memory. Nothing is written to disk, and the data disappears when the container stops. Good for sensitive data like secrets or session tokens that we don’t want lingering on disk.

docker run -d --name secure-app \
  --tmpfs /app/tmp:rw,size=64m \
  my-app

Volumes in docker-compose

This is where we’ll use volumes the most. In a compose file, we define volumes at the top level and reference them in services.

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: secret
    volumes:
      - pg-data:/var/lib/postgresql/data   # named volume

  app:
    build: .
    volumes:
      - ./src:/app/src                      # bind mount for dev
      - node_modules:/app/node_modules      # named volume for deps

volumes:
  pg-data:         # Docker manages this
  node_modules:    # Keeps node_modules inside volume, not on host

Common patterns

Database persistence — always use a named volume for database data directories. PostgreSQL uses /var/lib/postgresql/data, MySQL uses /var/lib/mysql, MongoDB uses /data/db.

Preserving node_modules — a common trick in Node.js projects. We bind-mount our source code but keep node_modules in a named volume so it doesn’t conflict with the host’s node_modules.

Backup a volume — volumes don’t have a built-in backup command, but we can use a temporary container to tar the data.

# Backup a volume to a tar file
docker run --rm \
  -v pg-data:/data \
  -v $(pwd):/backup \
  alpine tar czf /backup/pg-data-backup.tar.gz -C /data .

17

Docker Compose

intermediate docker compose multi-container yaml

Running one container is easy. But real apps have multiple services — a web server, a database, a cache, maybe a worker process. Running each one manually with long docker run commands gets painful fast.

Docker Compose lets us define all our services in a single docker-compose.yml file and manage them together. One command to start everything. One command to stop everything.

The anatomy of a compose file

services:
  web:                              # service name (also the DNS hostname)
    build: .                        # build from Dockerfile in current dir
    ports:
      - "3000:3000"                 # host:container port mapping
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/myapp
    depends_on:
      - db
      - redis
    restart: unless-stopped

  db:
    image: postgres:16-alpine       # use a pre-built image
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: myapp
    volumes:
      - pg-data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  pg-data:                          # named volume for DB persistence

Notice how the web service connects to the database using db as the hostname. Compose automatically creates a custom bridge network and sets up DNS for all services. The service name is the hostname.

Key configuration options

build — tells Compose to build an image from a Dockerfile instead of pulling one.

services:
  app:
    build:
      context: .                    # build context directory
      dockerfile: Dockerfile.prod   # non-default Dockerfile name

depends_on — controls startup order. But it only waits for the container to start, not for the service inside to be ready. A database container starting doesn’t mean PostgreSQL is accepting connections yet.

environment — two syntax options that do the same thing.

# List syntax
environment:
  - NODE_ENV=production
  - PORT=3000

# Map syntax
environment:
  NODE_ENV: production
  PORT: "3000"

env_file — loads environment variables from a file. Keeps secrets out of the compose file.

services:
  app:
    env_file:
      - .env

Essential commands

# Start all services (detached mode)
docker compose up -d

# Start and rebuild images
docker compose up -d --build

# Stop all services
docker compose down

# Stop and remove volumes too (careful — deletes data!)
docker compose down -v

# View logs
docker compose logs           # all services
docker compose logs -f web    # follow logs for one service

# Run a command in a running service
docker compose exec db psql -U user -d myapp

# Check status
docker compose ps

# Restart a single service
docker compose restart web

Profiles — optional services

Sometimes we want services that only run in certain situations. Profiles let us group services and start them selectively.

services:
  app:
    build: .
    ports:
      - "3000:3000"

  db:
    image: postgres:16-alpine
    volumes:
      - pg-data:/var/lib/postgresql/data

  adminer:
    image: adminer
    ports:
      - "8080:8080"
    profiles:
      - debug                       # only starts with --profile debug

  mailhog:
    image: mailhog/mailhog
    ports:
      - "8025:8025"
    profiles:
      - debug
# Normal start — adminer and mailhog won't run
docker compose up -d

# Start with debug services included
docker compose --profile debug up -d

A practical example

Here’s a compose file for a typical full-stack app — a Node.js API, PostgreSQL, and Redis. This is close to what real production setups look like.

services:
  api:
    build: .
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://app:secret@db:5432/myapp
      REDIS_URL: redis://cache:6379
      NODE_ENV: production
    depends_on:
      - db
      - cache
    restart: unless-stopped

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    volumes:
      - pg-data:/var/lib/postgresql/data
    restart: unless-stopped

  cache:
    image: redis:7-alpine
    command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
    restart: unless-stopped

volumes:
  pg-data:

The api service uses db and cache as hostnames — Compose’s built-in DNS handles the rest. The database data is persisted in a named volume so it survives restarts and deploys.


18

Container Debugging & Commands

intermediate docker debugging cli troubleshooting

Things break in containers. A service won’t start, a container keeps restarting, the app throws weird errors. The good news is Docker gives us great tools to figure out what’s going on.

The essentials

These are the commands we’ll use every single day.

# List running containers
docker ps

# List ALL containers (including stopped/crashed ones)
docker ps -a

# View container logs
docker logs my-app              # all logs
docker logs -f my-app           # follow (like tail -f)
docker logs --tail 50 my-app    # last 50 lines
docker logs --since 5m my-app   # last 5 minutes

# Get a shell inside a running container
docker exec -it my-app sh       # alpine/minimal images
docker exec -it my-app bash     # debian/ubuntu images

# Run a one-off command in a container
docker exec my-app cat /etc/hosts

Inspecting a container

docker inspect is the Swiss Army knife. It dumps everything Docker knows about a container — network settings, mounts, environment variables, restart policy, health status, and more.

# Full JSON dump (very verbose)
docker inspect my-app

# Get specific fields using Go templates
docker inspect --format='{{.State.Status}}' my-app
docker inspect --format='{{.NetworkSettings.IPAddress}}' my-app
docker inspect --format='{{.State.ExitCode}}' my-app
docker inspect --format='{{json .Mounts}}' my-app | jq

Debugging a crashed container

When a container exits immediately or keeps restarting, here’s our debugging flow:

Container Crash Debugging Flow
1. docker ps -a → find the stopped container, check STATUS column
2. docker logs my-app → read error messages from stdout/stderr
3. docker inspect my-app → check ExitCode, OOMKilled, Error fields
4. Start interactively to debug → override CMD with a shell
# Check why a container exited
docker inspect --format='{{.State.ExitCode}}' my-app
# Exit code 0   = normal exit
# Exit code 1   = application error
# Exit code 137  = killed (OOM or docker stop)
# Exit code 139  = segfault

# Check if it was killed due to out-of-memory
docker inspect --format='{{.State.OOMKilled}}' my-app

# Start a crashed container's image with a shell to poke around
docker run -it --entrypoint sh my-app:latest

Resource monitoring

# Live resource usage (CPU, memory, network I/O)
docker stats

# Resource usage for specific containers
docker stats my-app db redis

# See running processes inside a container
docker top my-app

# Copy files between host and container
docker cp my-app:/app/logs/error.log ./error.log
docker cp ./config.json my-app:/app/config.json

Common issues and fixes

Port conflict — “Bind for 0.0.0.0:3000 failed: port is already allocated”

# Find what's using that port
lsof -i :3000
# Either stop that process or use a different host port
docker run -p 3001:3000 my-app

Permission denied — often happens with volumes when the container runs as a non-root user but the volume files are owned by root.

# Fix: match the user ID in the container
docker run -u $(id -u):$(id -g) -v $(pwd):/app my-app

OOM Killed — container used more memory than its limit. Exit code 137.

# Check memory limit
docker inspect --format='{{.HostConfig.Memory}}' my-app

# Run with a higher memory limit
docker run -m 512m my-app

Cleanup commands

Docker doesn’t automatically clean up old stuff. Over time, unused images, stopped containers, and dangling volumes eat up disk space.

# Remove stopped containers
docker container prune

# Remove unused images (not used by any container)
docker image prune

# Remove unused volumes (careful — might delete data!)
docker volume prune

# Nuclear option — remove everything unused
docker system prune -a --volumes

# Check disk usage
docker system df

A good habit: run docker system prune every few weeks on development machines. On production, be more careful — we probably don’t want to accidentally remove volumes with data.


Kubernetes

19

Kubernetes Architecture

intermediate kubernetes control-plane etcd architecture

Kubernetes (K8s) is a container orchestration platform. In simple language, when we have dozens or hundreds of containers to run, we can’t manually SSH into machines and start them. Kubernetes handles that for us — it decides where containers run, restarts them when they crash, scales them up or down, and manages networking between them.

Why We Need It

Running one container with docker run is easy. Running 50 containers across 10 machines? That’s where things get messy. We’d need to handle:

  • Which machine has enough CPU/memory?
  • What happens when a container crashes at 3 AM?
  • How do containers on different machines talk to each other?
  • How do we roll out updates without downtime?

Kubernetes solves all of this. We tell it what we want (desired state), and it figures out how to make it happen.

The Big Picture

A Kubernetes cluster has two parts: the control plane (the brain) and worker nodes (the muscles).

Kubernetes Cluster
Control Plane (Brain)
📡 API Server — front door for everything
💾 etcd — key-value store (cluster state)
📋 Scheduler — picks which node runs what
🔄 Controller Manager — watches & fixes drift
Worker Node 1
⚙️ kubelet — agent on each node
🌐 kube-proxy — networking rules
📦 Container Runtime (containerd)
Worker Node 2
⚙️ kubelet
🌐 kube-proxy
📦 Container Runtime (containerd)

Control Plane Components

API Server (kube-apiserver) — the front door to the cluster. Every command we run with kubectl, every internal component talking to the cluster — it all goes through the API server. It validates requests and writes state to etcd.

etcd — a distributed key-value store that holds the entire cluster state. Think of it like the cluster’s database. If etcd dies and we have no backup, we lose the cluster. That’s why production setups run etcd with replication.

Scheduler (kube-scheduler) — when we create a Pod, it doesn’t have a node yet. The scheduler looks at resource requirements, constraints, and available capacity, then picks the best node.

Controller Manager (kube-controller-manager) — runs a bunch of control loops that watch the cluster state and make corrections. If we say “I want 3 replicas” and one dies, the controller notices and creates a new one.

Worker Node Components

kubelet — an agent running on every worker node. It talks to the API server, receives Pod assignments, and tells the container runtime to start/stop containers. It also reports node health back to the control plane.

kube-proxy — handles networking rules on each node. When a Service gets created, kube-proxy sets up the rules so traffic can reach the right Pods. It uses iptables or IPVS under the hood.

Container Runtime — the software that actually runs containers. Kubernetes doesn’t run containers itself. It delegates to a runtime like containerd or CRI-O via the Container Runtime Interface (CRI). Docker was removed as a direct runtime in K8s 1.24.

How They All Talk

The flow goes like this: we run kubectl apply → API server validates and stores in etcd → scheduler assigns a node → kubelet on that node picks it up → container runtime starts the container. Every component watches the API server for changes rather than components calling each other directly. This “watch and react” pattern is what makes Kubernetes so resilient.

In simple language, the control plane is the brain that makes decisions, and worker nodes are the hands that do the actual work. We tell the brain what we want, and it coordinates the hands to make it happen.


20

Pods and Workloads

intermediate kubernetes pods deployments statefulsets

A Pod is the smallest thing we can deploy in Kubernetes. It wraps one or more containers that share the same network and storage. Most of the time, it’s just one container per Pod. Think of a Pod as a thin wrapper around our container that gives Kubernetes something to manage.

Pod Lifecycle

Every Pod goes through these phases:

  • Pending — accepted by the cluster, but containers aren’t running yet (maybe pulling images or waiting for scheduling)
  • Running — at least one container is running
  • Succeeded — all containers finished successfully (exit code 0)
  • Failed — all containers terminated, at least one failed
  • Unknown — can’t get the Pod’s status (usually a node communication issue)
# A basic Pod definition — we rarely write these directly
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
    - name: app
      image: nginx:1.27
      ports:
        - containerPort: 80

Multi-Container Patterns

Sometimes we need more than one container in a Pod. The containers share localhost and can share volumes.

Sidecar — a helper container that runs alongside our main app. Common examples: log collector, service mesh proxy (Envoy), or a config reloader.

Init Container — runs before the main container starts. Useful for setup tasks like waiting for a database to be ready or running migrations.

spec:
  initContainers:
    - name: wait-for-db
      image: busybox
      command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 2; done']
  containers:
    - name: app
      image: my-app:latest

Deployments — The Go-To Workload

We almost never create Pods directly. Instead, we use a Deployment, which manages Pods for us. It gives us:

  • Desired state — “I want 3 replicas running at all times”
  • Rolling updates — updates Pods one by one so there’s zero downtime
  • Rollback — if something goes wrong, roll back to the previous version
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3                    # keep 3 Pods running
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-app:v2
          ports:
            - containerPort: 8080

A Deployment creates a ReplicaSet under the hood, which is the thing that actually ensures the right number of Pods are running. We don’t manage ReplicaSets directly — the Deployment handles it.

# Common Deployment commands
kubectl rollout status deployment/my-app    # watch the rollout
kubectl rollout history deployment/my-app   # see revision history
kubectl rollout undo deployment/my-app      # rollback to previous version

StatefulSets — When Identity Matters

Deployments treat all Pods as interchangeable. But some workloads need stable identity — databases, message queues, etc. StatefulSets give each Pod:

  • A stable hostname (my-db-0, my-db-1, my-db-2)
  • Persistent storage that sticks with the Pod even if it restarts
  • Ordered startup/shutdown (Pod 0 starts first, then Pod 1, etc.)

We use these for things like PostgreSQL clusters, Kafka brokers, or Redis Sentinel.

DaemonSets — One Pod Per Node

A DaemonSet ensures that every node (or a subset) runs exactly one copy of a Pod. When a new node joins the cluster, the DaemonSet automatically schedules a Pod on it. Perfect for:

  • Log collectors (Fluentd, Filebeat)
  • Monitoring agents (Prometheus Node Exporter)
  • Network plugins (Calico, Cilium)

Jobs and CronJobs

Job — runs a Pod to completion and then stops. Great for batch tasks like data processing or database migrations.

CronJob — a Job on a schedule, just like a Linux cron. Runs at specified times.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"        # every day at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: backup-tool:latest
          restartPolicy: OnFailure

In simple language, Pods are the atomic unit, but we almost always manage them through higher-level workloads. Deployments for stateless apps, StatefulSets for databases, DaemonSets for per-node agents, and Jobs for one-off tasks.


21

Services and Networking

intermediate kubernetes services ingress networking

Pods are ephemeral. They get created, destroyed, and rescheduled all the time. Every time a Pod restarts, it gets a new IP address. So how do other Pods find and talk to our app? That’s what Services solve — they give us a stable endpoint that doesn’t change even when the Pods behind it do.

Service Types

Service Types — From Internal to External
ClusterIP
Internal only. Other Pods can reach it, but nothing outside the cluster can. Default type.
NodePort
Opens a port (30000–32767) on every node. External traffic hits <NodeIP>:<Port>. Good for dev/testing.
LoadBalancer
Provisions a cloud load balancer (AWS ELB, GCP LB). The real way to expose services in production on cloud.
ExternalName
Maps a Service to an external DNS name (like an RDS endpoint). No proxying, just a CNAME alias.

ClusterIP — The Default

Most Services are ClusterIP. They give us a virtual IP inside the cluster that load-balances traffic across matching Pods.

apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  type: ClusterIP              # default, can omit this line
  selector:
    app: my-app                # routes to Pods with this label
  ports:
    - port: 80                 # port the Service listens on
      targetPort: 8080         # port on the Pod

The selector is the glue. The Service watches for all Pods with app: my-app label and routes traffic to them.

NodePort

Builds on ClusterIP but also opens a static port on every node in the cluster.

spec:
  type: NodePort
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30080          # optional, K8s picks one if omitted

Now we can hit http://<any-node-ip>:30080 from outside. Not great for production (ugly ports, no SSL), but handy for quick access.

LoadBalancer

The production choice on cloud providers. It creates a ClusterIP + NodePort + a cloud load balancer that routes external traffic in.

spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 8080

The cloud provider assigns an external IP. One downside: each LoadBalancer Service gets its own cloud LB, which can get expensive. That’s where Ingress comes in.

Service Discovery via DNS

Kubernetes runs a DNS server (CoreDNS) inside the cluster. Every Service gets a DNS entry automatically:

  • my-app-service — works within the same namespace
  • my-app-service.default.svc.cluster.local — fully qualified (namespace = default)

So our app code just uses the Service name as the hostname. No hardcoded IPs.

# From inside any Pod in the same namespace
curl http://my-app-service:80

Ingress — Smart HTTP Routing

Instead of giving every Service its own LoadBalancer, we can use a single Ingress controller to route traffic based on hostnames or URL paths.

An Ingress Controller (like nginx-ingress or Traefik) is the actual reverse proxy. An Ingress resource is the routing config.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: frontend-service
                port:
                  number: 80

This routes api.example.com to one Service and app.example.com to another — all through a single load balancer. Way cheaper and cleaner.

In simple language, Services give Pods a stable address, and Ingress lets us route external HTTP traffic to the right Service based on domain names or paths. For internal communication we use ClusterIP, for external access we use Ingress or LoadBalancer.


22

ConfigMaps and Secrets

intermediate kubernetes configmaps secrets configuration

Hardcoding config values inside a container image is a bad idea. Every time a database URL or feature flag changes, we’d need to rebuild the image. Kubernetes solves this with ConfigMaps (for non-sensitive config) and Secrets (for sensitive stuff like passwords and API keys).

ConfigMaps

A ConfigMap holds key-value pairs of configuration data. We can create them in a few ways.

# From literal values
kubectl create configmap app-config \
  --from-literal=APP_ENV=production \
  --from-literal=LOG_LEVEL=info

# From a file
kubectl create configmap nginx-config --from-file=nginx.conf

# From an env file
kubectl create configmap app-config --from-env-file=.env

Or define it in YAML:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  APP_ENV: "production"
  LOG_LEVEL: "info"
  DATABASE_HOST: "db-service"

Using ConfigMaps in Pods

We have two options: environment variables or mounted files.

As environment variables:

spec:
  containers:
    - name: app
      image: my-app:latest
      envFrom:
        - configMapRef:
            name: app-config        # injects ALL keys as env vars
      env:
        - name: SPECIFIC_VAR       # or pick specific keys
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: LOG_LEVEL

As mounted files:

spec:
  containers:
    - name: app
      image: my-app:latest
      volumeMounts:
        - name: config-volume
          mountPath: /etc/config    # each key becomes a file
  volumes:
    - name: config-volume
      configMap:
        name: app-config

With volume mounts, each key in the ConfigMap becomes a file in the mount path. So LOG_LEVEL becomes /etc/config/LOG_LEVEL with content info. The bonus? Volume-mounted ConfigMaps auto-update when the ConfigMap changes (with a small delay). Environment variables don’t — they need a Pod restart.

Secrets

Secrets work almost identically to ConfigMaps but are meant for sensitive data. The values are base64-encoded (not encrypted!).

apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
data:
  DB_PASSWORD: cGFzc3dvcmQxMjM=    # base64 of "password123"
  DB_USER: YWRtaW4=                 # base64 of "admin"
# Easier: create from command line (handles encoding for us)
kubectl create secret generic db-credentials \
  --from-literal=DB_PASSWORD=password123 \
  --from-literal=DB_USER=admin

Using Secrets in Pods is the same pattern — secretRef instead of configMapRef:

envFrom:
  - secretRef:
      name: db-credentials

Important: Secrets Are NOT Encrypted by Default

This is a common interview gotcha. Kubernetes Secrets are only base64-encoded, which is not encryption. Anyone with kubectl get secret -o yaml access can decode them. To actually secure Secrets:

  • Enable encryption at rest in etcd (EncryptionConfiguration)
  • Use external secret managers like AWS Secrets Manager, HashiCorp Vault, or Sealed Secrets
  • Limit access with RBAC (don’t let every developer read Secrets)

When to Use Which

  • ConfigMap — non-sensitive config like log levels, feature flags, endpoint URLs, config files
  • Secret — passwords, tokens, TLS certificates, API keys

Immutable ConfigMaps and Secrets

Since Kubernetes 1.21, we can mark a ConfigMap or Secret as immutable: true. Once set, it can’t be changed — only deleted and recreated. This is useful because:

  • It protects against accidental updates
  • It improves cluster performance (the API server skips watching immutable objects)
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-v2
data:
  APP_ENV: "production"
immutable: true

In simple language, ConfigMaps and Secrets let us keep config outside our container images. ConfigMaps for regular config, Secrets for sensitive data. Just remember that Secrets need extra steps to be truly secure.


23

Persistent Volumes and Storage

intermediate kubernetes pv pvc storage-classes

Container storage is ephemeral by default. When a Pod dies, everything inside the container’s filesystem dies with it. That’s fine for stateless apps, but terrible for databases. If our PostgreSQL Pod restarts, we don’t want to lose all our data. That’s where Persistent Volumes come in.

The Three Pieces

Kubernetes storage has three main objects, and they work together like a request system.

Storage Flow
StorageClass
the "how" — defines provisioner
PVC (Claim)
the "request" — I need 10Gi
PV (Volume)
the actual disk/storage
Pod
mounts and uses it

PersistentVolume (PV) — the actual storage resource. Think of it like a physical hard drive in the cluster. It can be an AWS EBS volume, a GCP Persistent Disk, an NFS share, or local storage on a node.

PersistentVolumeClaim (PVC) — a request for storage. Our Pod says “I need 10Gi of fast storage” and Kubernetes finds (or creates) a matching PV.

StorageClass — defines how storage gets provisioned. Instead of manually creating PVs, the StorageClass tells Kubernetes to dynamically create them when a PVC asks.

Static Provisioning

We manually create a PV, then claim it with a PVC.

# The actual storage
apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:                     # local storage (dev only!)
    path: /data/my-volume
---
# The request
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Dynamic Provisioning — The Better Way

In production, we don’t want to manually create PVs. We define a StorageClass and let Kubernetes provision storage on-demand.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com   # AWS EBS CSI driver
parameters:
  type: gp3                     # SSD storage type
reclaimPolicy: Delete           # delete PV when PVC is deleted
volumeBindingMode: WaitForFirstConsumer

Now our PVC just references the StorageClass:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db-storage
spec:
  storageClassName: fast-ssd    # use our StorageClass
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

Kubernetes automatically creates a 20Gi EBS volume when this PVC is created. No manual PV needed.

Using a PVC in a Pod

spec:
  containers:
    - name: postgres
      image: postgres:16
      volumeMounts:
        - name: db-data
          mountPath: /var/lib/postgresql/data
  volumes:
    - name: db-data
      persistentVolumeClaim:
        claimName: db-storage

Access Modes

  • ReadWriteOnce (RWO) — one node can mount it as read-write. Most common for databases.
  • ReadOnlyMany (ROX) — many nodes can mount it as read-only. Good for shared config or static assets.
  • ReadWriteMany (RWX) — many nodes can mount it as read-write. Needs special storage backends like NFS or EFS.

Most cloud block storage (EBS, Persistent Disk) only supports RWO. For RWX, we need network file systems.

Reclaim Policies

What happens to the PV when the PVC is deleted?

  • Delete — the PV and underlying storage are deleted. Default for dynamic provisioning.
  • Retain — the PV sticks around with its data. We have to manually clean it up. Safer for critical data.

Common Pattern: Database Storage

For databases, we typically use a StatefulSet with a volumeClaimTemplate. Each Pod gets its own PVC that persists across restarts.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 1
  template:
    spec:
      containers:
        - name: postgres
          image: postgres:16
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:          # each replica gets its own PVC
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi

In simple language, PVCs are like ordering storage from a menu (StorageClass). We say what we need, and Kubernetes provisions the real storage (PV) behind the scenes. Our Pod just mounts the PVC and uses it like a normal directory.


24

Resource Management and Scaling

advanced kubernetes hpa resources autoscaling

If we don’t tell Kubernetes how much CPU and memory our Pods need, it’s flying blind. Pods could hog all the resources on a node, starve other workloads, or get killed randomly. Resource management is how we keep things predictable.

Requests vs Limits

Every container can have two resource settings:

  • Requests — the minimum guaranteed resources. The scheduler uses this to decide which node has enough room.
  • Limits — the maximum a container can use. It’s the ceiling.
spec:
  containers:
    - name: app
      image: my-app:latest
      resources:
        requests:
          cpu: "250m"           # 250 millicores = 0.25 CPU
          memory: "128Mi"       # 128 mebibytes
        limits:
          cpu: "500m"           # can burst up to 0.5 CPU
          memory: "256Mi"       # hard ceiling

A quick note on units: 1 CPU = 1 vCPU/core. 250m = 0.25 cores. Memory uses Mi (mebibytes) or Gi (gibibytes).

What Happens When Limits Are Exceeded

This is a common interview question, and the answer is different for CPU vs memory:

  • CPU limit exceeded — the container gets throttled. It won’t crash, but it’ll run slower. The kernel simply won’t give it more CPU time.
  • Memory limit exceeded — the container gets OOMKilled (Out Of Memory Killed). Kubernetes terminates it immediately. This is harsh but necessary to protect the node.
CPU vs Memory — Over Limit Behavior
CPU Over Limit
Throttled (slowed down)
Pod stays alive, just slower
Memory Over Limit
OOMKilled (terminated)
Pod is killed and restarted

QoS Classes

Kubernetes assigns a Quality of Service class to every Pod based on its resource settings. When a node runs out of memory, K8s uses QoS to decide which Pods to evict first.

  • Guaranteed — requests equal limits for all containers. Last to be evicted. Set this for critical workloads.
  • Burstable — requests are set but are lower than limits (or limits aren’t set). Evicted after BestEffort.
  • BestEffort — no requests or limits set at all. First to be evicted. Avoid this in production.
# Guaranteed — requests == limits
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

# Burstable — requests < limits
resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

LimitRanges and ResourceQuotas

Cluster admins use these to enforce guardrails.

LimitRange — sets default and max/min resources per container in a namespace. If a developer forgets to set requests, the LimitRange fills in defaults.

ResourceQuota — sets total resource caps per namespace. For example, “the dev namespace can’t use more than 20 CPU cores and 64Gi memory total.”

apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "64Gi"
    limits.cpu: "40"
    limits.memory: "128Gi"
    pods: "50"                  # max 50 Pods in this namespace

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of Pod replicas based on metrics like CPU or memory usage. This is the most common autoscaling approach.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # scale up when CPU > 70%
# Quick way to create an HPA
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=70

HPA checks metrics every 15 seconds by default. It scales up quickly but scales down slowly (5-minute stabilization window) to avoid flapping.

Vertical Pod Autoscaler (VPA)

Instead of adding more Pods, VPA adjusts the CPU and memory requests of existing Pods. Useful when we don’t know the right resource values upfront — VPA watches actual usage and recommends (or automatically applies) better values.

The catch: VPA has to restart Pods to apply new resource values, so it’s often used in “recommend-only” mode where it suggests values and we apply them ourselves.

Cluster Autoscaler

Operates at the infrastructure level. When Pods can’t be scheduled because there aren’t enough nodes, the Cluster Autoscaler adds more nodes to the cluster. When nodes are underutilized, it removes them.

  • HPA scales Pods (horizontal)
  • VPA right-sizes Pods (vertical)
  • Cluster Autoscaler scales nodes (infrastructure)

They work together: HPA creates more Pods → Pods become unschedulable → Cluster Autoscaler adds nodes.

In simple language, requests tell the scheduler what we need, limits protect the node from greedy containers, and autoscalers keep everything right-sized based on actual traffic. Always set requests and limits in production — a Pod without them is a ticking time bomb.


25

RBAC and Security

advanced kubernetes rbac security network-policies

By default, Kubernetes is pretty open — Pods can talk to any other Pod, and users with cluster access can do almost anything. In production, that’s a security nightmare. We need to lock things down: who can do what (RBAC), how Pods are allowed to run (PodSecurity), and which Pods can talk to which (NetworkPolicies).

RBAC — Role-Based Access Control

RBAC answers the question: “Can this user/service do this action on this resource?” It has four key objects.

RBAC Model
Namespace-scoped
Role — what actions are allowed
RoleBinding — who gets the Role
Cluster-wide
ClusterRole — what actions are allowed (all namespaces)
ClusterRoleBinding — who gets the ClusterRole

The only difference between Role and ClusterRole is scope. A Role works within a single namespace. A ClusterRole works across the entire cluster (and can also cover non-namespaced resources like nodes).

Creating a Role and RoleBinding

Let’s say we want to give a developer read-only access to Pods in the dev namespace.

# Step 1: Define what's allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: dev
rules:
  - apiGroups: [""]             # core API group
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
# Step 2: Bind it to a user or ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dev-pod-reader
  namespace: dev
subjects:
  - kind: User
    name: alice
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Now alice can read Pods in the dev namespace but nothing else. That’s the principle of least privilege — give only the minimum permissions needed.

ServiceAccounts

Humans use user accounts. Pods use ServiceAccounts. Every namespace has a default ServiceAccount, but it has limited permissions. For Pods that need to talk to the Kubernetes API (like monitoring tools or CI runners), we create a dedicated ServiceAccount and bind a Role to it.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: monitoring-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-access
subjects:
  - kind: ServiceAccount
    name: monitoring-sa
    namespace: monitoring
roleRef:
  kind: ClusterRole
  name: view                    # built-in read-only ClusterRole
  apiGroup: rbac.authorization.k8s.io
# Check what a ServiceAccount can do
kubectl auth can-i list pods --as=system:serviceaccount:monitoring:monitoring-sa

Pod Security Standards

We don’t want Pods running as root, mounting the host filesystem, or running in privileged mode. Kubernetes defines three security profiles:

  • Privileged — no restrictions at all (only for system-level Pods like CNI plugins)
  • Baseline — blocks the most dangerous settings (no privileged containers, no hostNetwork)
  • Restricted — the strictest. Must run as non-root, read-only root filesystem, drop all capabilities

We enforce these at the namespace level using labels:

# Enforce restricted profile on the production namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted

SecurityContext

We can also harden individual Pods with a SecurityContext. This is where we set the fine-grained security settings.

spec:
  securityContext:
    runAsNonRoot: true            # Pod must not run as root
    runAsUser: 1000               # run as UID 1000
    fsGroup: 2000                 # group ownership for mounted volumes
  containers:
    - name: app
      image: my-app:latest
      securityContext:
        readOnlyRootFilesystem: true    # can't write to container FS
        allowPrivilegeEscalation: false # prevent gaining more privileges
        capabilities:
          drop: ["ALL"]           # drop all Linux capabilities

These settings should be our defaults for production Pods. If our app needs to write files, we mount a writable volume instead of making the entire root filesystem writable.

NetworkPolicies — Pod-Level Firewall

By default, every Pod can talk to every other Pod in the cluster. NetworkPolicies let us restrict that. Think of them like firewall rules.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api                   # applies to Pods with this label
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend      # only frontend Pods can reach API
      ports:
        - port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database      # API can only talk to database
      ports:
        - port: 5432

This policy says: the api Pods can only receive traffic from frontend Pods on port 8080, and can only send traffic to database Pods on port 5432. Everything else is blocked.

One important thing: NetworkPolicies need a CNI plugin that supports them (Calico, Cilium, Weave). The default kubenet plugin doesn’t enforce them — the policies will be created but silently ignored.

A common starting pattern is a default deny rule that blocks all traffic, then we add specific allow rules:

# Default deny all ingress traffic in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}               # matches ALL Pods in namespace
  policyTypes:
    - Ingress                   # no ingress rules = deny all

In simple language, RBAC controls who can do what in the cluster, PodSecurity controls how containers are allowed to run, and NetworkPolicies control which Pods can talk to each other. Layer all three for defense in depth.


CI/CD & GitOps

26

CI/CD Fundamentals

beginner ci-cd continuous-integration continuous-delivery

Before CI/CD, developers would work on features for weeks, then spend days merging everyone’s code together and praying nothing broke. CI/CD fixes that by automating the boring stuff — building, testing, and deploying — so we can ship fast without losing sleep.

What Is CI (Continuous Integration)?

Every time someone pushes code, the system automatically builds and tests it. That’s CI.

The goal is simple: catch bugs early, when they’re cheap to fix. If we push broken code at 10am, we know by 10:02am instead of finding out three weeks later during a manual QA pass.

Key ideas:

  • Everyone merges to the main branch frequently (at least daily)
  • Every push triggers an automated build + test run
  • If tests fail, fixing the build is the top priority

What Is CD (Continuous Delivery vs Continuous Deployment)?

Here’s where people get confused — CD means two different things.

Continuous Delivery — code is always in a deployable state. After passing all tests, it’s ready to go to production, but a human clicks the “deploy” button.

Continuous Deployment — takes it one step further. Every change that passes the pipeline goes straight to production. No human in the loop.

In simple language, delivery means “we can deploy anytime,” and deployment means “we do deploy every time.”

The Pipeline Mental Model

Think of a CI/CD pipeline like an assembly line in a factory. Code moves through stages, and if it fails any stage, it stops.

CI/CD Pipeline Flow
📝 Code Push 🔨 Build 🧪 Test 🔍 Scan 🚀 Deploy
CI covers Build + Test  |  CD covers everything after

Why It Matters

  • Catch bugs early — a failing test at push time is way cheaper than a bug in production
  • Faster releases — no more “release weekends” with the whole team on standby
  • Confidence — if the pipeline is green, we know the code works
  • Smaller changes — frequent deploys mean smaller diffs, which are easier to review and debug

A Minimal Pipeline Example

# .github/workflows/ci.yml — simplest possible CI pipeline
name: CI
on: [push, pull_request]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4       # grab the code
      - uses: actions/setup-node@v4     # install Node.js
        with:
          node-version: 20
      - run: npm ci                     # install deps (clean)
      - run: npm run build              # build the project
      - run: npm test                   # run tests

This runs on every push and every PR. If any step fails, the whole job fails and we get notified. That’s CI in its simplest form — push code, get feedback fast.


27

Pipeline Design

intermediate ci-cd pipelines github-actions stages

A CI/CD pipeline is only useful if it’s fast, reliable, and catches real problems. A badly designed pipeline either takes 40 minutes (so everyone ignores it) or passes when it shouldn’t. Let’s build one that actually works.

Pipeline Stages

Most production pipelines follow this order:

Pipeline Stage Order
1. Lint        ← catch style issues in seconds (cheapest check)
2. Build       ← compile/transpile the code
3. Unit Test   ← fast isolated tests
4. Integration  ← tests that hit real databases/APIs
5. Security    ← dependency audit, SAST scan
6. Deploy      ← push to staging or production
Fast + cheap checks first → Slow + expensive checks last

The rule of thumb: put the fastest, cheapest checks first. If linting fails in 5 seconds, there’s no point waiting 10 minutes for integration tests.

Key Concepts

Artifacts — output from one stage that the next stage needs. For example, the build stage produces a compiled binary, and the deploy stage uses it. We don’t want to rebuild in every stage.

Caching — storing node_modules or .m2 directories between runs so we don’t re-download every dependency each time. This alone can cut pipeline time in half.

Parallel jobs — lint, unit tests, and security scans don’t depend on each other. Run them at the same time.

A Real GitHub Actions Workflow

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'                    # cache node_modules automatically
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint                           # only run if lint passes
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test

  deploy:
    runs-on: ubuntu-latest
    needs: test                           # only run if tests pass
    if: github.ref == 'refs/heads/main'   # only deploy from main
    steps:
      - uses: actions/checkout@v4
      - run: echo "Deploying to production..."
      # real deploy step goes here

Notice how needs creates the dependency chain: lint → test → deploy. The cache: 'npm' line saves us from re-downloading packages every run.

GitLab CI Equivalent (Quick Look)

# .gitlab-ci.yml
stages:
  - lint
  - test
  - deploy

lint:
  stage: lint
  script: npm run lint
  cache:
    paths:
      - node_modules/

test:
  stage: test
  script: npm test

deploy:
  stage: deploy
  script: ./deploy.sh
  only:
    - main

The only difference is syntax. GitLab uses stages and stage: keywords, while GitHub uses jobs and needs:. The mental model is the same.

Pipeline Design Tips

  • Keep it under 10 minutes — if it’s longer, people stop waiting and merge without checking
  • Fail fast — put linting and type-checking first
  • Cache aggressively — dependencies, Docker layers, build artifacts
  • Use matrix builds to test across Node 18/20/22 or Python 3.10/3.12 in parallel
  • Don’t deploy on PR — only deploy when code lands on main
  • Store secrets in the CI provider (GitHub Secrets, GitLab Variables) — never in the repo

28

Deployment Strategies

intermediate deployment blue-green canary rolling

Deploying code is the scariest part of software development. We’ve all seen “we’re deploying, fingers crossed” in Slack. Good deployment strategies take the fear out of shipping by giving us safe ways to roll out changes — and safe ways to roll them back.

Rolling Update

The most common strategy. We replace instances one at a time (or a few at a time). At any point during the deploy, some instances run the old version and some run the new one.

Think of it like replacing tires on a moving bus — one at a time, so the bus never stops.

  • Zero downtime
  • Simple to set up (default in Kubernetes)
  • Rollback by continuing to roll — but with the old version

The catch: for a brief period, both versions are live. If v2 changes the API or database schema, v1 instances might break.

Blue-Green Deployment

We run two identical environments — Blue (current) and Green (new). All traffic goes to Blue. We deploy to Green, test it, then flip the load balancer to point at Green.

Blue-Green Deployment
Before switch:
  Users → Load BalancerBlue (v1) ✅ LIVE
                                    Green (v2) 🧪 testing
After switch:
  Users → Load BalancerGreen (v2) ✅ LIVE
                                    Blue (v1) standby
Rollback = flip the switch back to Blue. Instant.

The big win: rollback is instant — just switch traffic back. The downside: we need double the infrastructure (and cost).

Canary Deployment

We route a small percentage of traffic (say 5%) to the new version and watch the metrics. If error rates stay normal, we gradually increase to 100%.

Think of it like the “canary in a coal mine” — if the canary (small traffic slice) dies, we pull back before it affects everyone.

# Conceptual traffic split — usually handled by a service mesh or ingress
# Step 1: 5% to v2
# Step 2: watch error rates, latency, logs for 15 minutes
# Step 3: if healthy, bump to 25%, then 50%, then 100%
# Step 4: if something's wrong, route 100% back to v1

This is the safest strategy for large-scale systems. But it needs good observability (metrics, logs, alerts) to detect problems in that 5%.

Comparison At a Glance

Strategy Trade-offs
Rolling      — Simple, zero downtime, slow rollback
Blue-Green  — Instant rollback, double the cost
Canary      — Safest, needs good monitoring
Recreate    — Kill old, start new. Has downtime. Only for dev/staging.

Feature Flags: The Missing Piece

Feature flags let us deploy code without activating it. We ship the feature behind a flag, enable it for 1% of users, and watch. If it’s broken, we toggle the flag off — no redeploy needed.

# Conceptual — most teams use LaunchDarkly, Unleash, or a simple env var
if FEATURE_NEW_CHECKOUT == "true":
    show_new_checkout()
else:
    show_old_checkout()

In simple language, deployment strategies are about one thing: how much risk are we comfortable with? Rolling is the default. Blue-green is for when we need instant rollback. Canary is for when we need proof the new version works before committing.


29

GitOps and ArgoCD

intermediate gitops argocd declarative sync

Traditional CI/CD works like this: pipeline builds the code, then pushes it to the cluster. GitOps flips that model — the cluster pulls the desired state from Git. It’s a subtle difference, but it changes everything.

GitOps Principles

There are four core ideas:

  1. Git is the single source of truth — everything (app manifests, infra config, even policies) lives in Git
  2. Declarative desired state — we describe what we want, not how to get there
  3. Automated reconciliation — an agent constantly compares the actual state with Git and fixes any drift
  4. Changes through PRs only — no one kubectl applys from their laptop. Every change goes through a pull request

In simple language, GitOps means: if it’s not in Git, it doesn’t exist. And if Git says “3 replicas,” the cluster will have 3 replicas — always.

Push vs Pull Model

Push vs Pull Deployment
Push (Traditional CI/CD):
  CI Pipeline → builds image → kubectl apply → Cluster
  Pipeline needs cluster credentials. Security risk.
Pull (GitOps):
  CI Pipeline → builds image → updates Git manifest
  ArgoCD (in cluster) → watches Git → syncs to Cluster
  Only ArgoCD needs cluster access. Pipeline never touches the cluster.

The pull model is more secure because our CI pipeline never needs cluster credentials. The agent inside the cluster does all the work.

ArgoCD Basics

ArgoCD is the most popular GitOps tool for Kubernetes. It runs inside our cluster and watches a Git repo for changes.

The core concept is the Application CRD — a custom resource that tells ArgoCD: “watch this Git repo, this path, and sync it to this namespace.”

# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-manifests.git
    targetRevision: main
    path: apps/my-app              # folder containing K8s manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:                     # auto-sync when Git changes
      prune: true                  # delete resources removed from Git
      selfHeal: true               # fix manual changes (drift)

Key sync policies:

  • Auto-sync — ArgoCD applies changes automatically when it detects a new commit
  • Prune — if we delete a manifest from Git, ArgoCD deletes the resource from the cluster
  • Self-heal — if someone manually edits a resource in the cluster, ArgoCD reverts it to match Git

Drift Detection

This is where GitOps really shines. If someone SSH’s into the cluster and runs kubectl scale deployment my-app --replicas=1 (maybe during a panic), ArgoCD detects that the cluster state doesn’t match Git and automatically reverts it.

No more “who changed this in production?” mysteries. Git history is the complete audit log.

Why GitOps Matters for Auditing

Every change is a Git commit. That means:

  • We know who changed what (commit author)
  • We know when it changed (commit timestamp)
  • We know why it changed (PR description)
  • We can revert any change by reverting the commit

For regulated industries (finance, healthcare), this audit trail is gold. No more “we think someone ran this command last Tuesday.”

The Typical GitOps Workflow

# 1. Developer changes app code, pushes to app repo
# 2. CI pipeline builds image, tags it (e.g., v1.2.3)
# 3. CI updates the image tag in the manifests repo
# 4. ArgoCD detects the change in the manifests repo
# 5. ArgoCD syncs the new image to the cluster
# 6. If something breaks, revert the manifest commit

In simple language, GitOps is version control for our entire infrastructure. We already trust Git with our code — GitOps extends that trust to deployments, config, and everything else running in production.


30

Artifact Management and Registries

intermediate registries artifacts versioning scanning

When our CI pipeline builds something — a Docker image, a JAR file, an npm package — that output is called an artifact. We need somewhere to store these artifacts, version them, and make sure they’re safe. That’s what registries and artifact management are about.

What Are Artifacts?

An artifact is the built output of our code. It’s the thing that actually runs in production.

  • Docker images — the most common artifact in modern DevOps
  • JARs / WARs — Java applications
  • npm packages — JavaScript libraries
  • Helm charts — Kubernetes package definitions
  • Binary executables — Go, Rust compiled binaries

The key distinction: source code is what we write, artifacts are what we deploy.

Container Registries

A container registry is like npm or PyPI, but for Docker images. We push images to it after building, and pull from it when deploying.

Popular Container Registries
Docker Hub     — the default, free for public images
GHCR         — GitHub Container Registry, great with GitHub Actions
ECR          — AWS Elastic Container Registry
GCR / GAR    — Google Container / Artifact Registry
ACR          — Azure Container Registry
Harbor        — self-hosted, open source
# build and push to GHCR
docker build -t ghcr.io/myorg/myapp:v1.2.3 .
docker push ghcr.io/myorg/myapp:v1.2.3

# pull on the server
docker pull ghcr.io/myorg/myapp:v1.2.3

Image Tagging Strategies

How we tag images matters more than people think. A bad tagging strategy leads to “what’s actually running in production?” confusion.

Semantic versioningv1.2.3. Clear, human-readable, great for releases.

Git SHAabc123f. Ties the image directly to a commit. No ambiguity.

latest — always points to the most recent build. Convenient but dangerous — we never know exactly which version latest is.

# good practice: tag with both semver and git SHA
docker build \
  -t ghcr.io/myorg/myapp:v1.2.3 \
  -t ghcr.io/myorg/myapp:$(git rev-parse --short HEAD) \
  .

The rule: never use latest in production manifests. Always pin to a specific version or SHA. Using latest means our deployments aren’t reproducible.

Vulnerability Scanning

Every Docker image contains an OS layer, libraries, and our code. Any of those can have known vulnerabilities. Scanning tools check our images against CVE databases.

# scan with Trivy (free, open source)
trivy image ghcr.io/myorg/myapp:v1.2.3

# scan and fail if critical vulnerabilities found (great for CI)
trivy image --exit-code 1 --severity CRITICAL ghcr.io/myorg/myapp:v1.2.3

Most registries now offer built-in scanning too — ECR scans on push, GHCR integrates with Dependabot, and Harbor has Trivy built in.

Image Signing

Signing proves that an image came from us and hasn’t been tampered with. Tools like Cosign (from Sigstore) make this easy.

# sign an image after pushing
cosign sign ghcr.io/myorg/myapp:v1.2.3

# verify before deploying
cosign verify ghcr.io/myorg/myapp:v1.2.3

In simple language, artifact management is the supply chain of our software. We build it, store it safely, check it for problems, sign it to prove it’s legit, and then deploy it. Skipping any of these steps is how supply chain attacks happen.


Cloud & Infrastructure

31

Cloud Computing Models

beginner cloud iaas paas saas serverless

Cloud computing is just using someone else’s computers. But the amount of control (and responsibility) we get varies a lot depending on which model we pick. Let’s break down the four main models.

The Four Models

IaaS (Infrastructure as a Service) — the cloud gives us virtual machines, networking, and storage. We manage everything from the OS up. Think of it like renting an empty apartment — we bring our own furniture.

Examples: AWS EC2, Google Compute Engine, Azure VMs, DigitalOcean Droplets.

PaaS (Platform as a Service) — the cloud manages the OS, runtime, and scaling. We just deploy our app. Think of it like a co-working space — desk, wifi, and coffee are provided. We just bring our laptop.

Examples: Heroku, Google App Engine, AWS Elastic Beanstalk, Railway.

SaaS (Software as a Service) — we use the software. That’s it. No deploying, no managing, no thinking about servers.

Examples: Gmail, Slack, Notion, GitHub.

Serverless — we write functions. The cloud runs them when triggered, scales them automatically, and charges per invocation. We don’t think about servers at all.

Examples: AWS Lambda, Google Cloud Functions, Azure Functions, Cloudflare Workers.

What We Manage at Each Level

Responsibility Stack
Layer
IaaS
PaaS
Serverless
SaaS
Application
Us
Us
Us
Provider
Runtime
Us
Provider
Provider
Provider
OS
Us
Provider
Provider
Provider
Hardware
Provider
Provider
Provider
Provider
Red = we manage it  |  Green = provider manages it

The higher up the stack we go, the less we manage — but the less control we have.

Shared Responsibility Model

This is a concept every cloud provider pushes hard. The cloud provider secures the cloud itself (physical servers, networking hardware, data centers). We secure what’s in the cloud (our code, our data, our IAM policies, our configurations).

If AWS’s data center catches fire — that’s their problem. If our S3 bucket is publicly readable — that’s ours.

Multi-Cloud vs Hybrid Cloud

Multi-cloud — using multiple cloud providers (e.g., AWS for compute, GCP for ML). Why? Avoid vendor lock-in, use best-of-breed services, or satisfy compliance requirements.

Hybrid cloud — mixing on-premises servers with cloud resources. Common in enterprises that can’t fully migrate due to regulation or legacy systems.

How to Pick

# Decision flow:
# Need full OS control or custom networking?    → IaaS
# Just want to deploy a web app fast?           → PaaS
# Running event-driven, short-lived tasks?      → Serverless
# Don't want to manage anything?                → SaaS

In simple language, cloud models are a spectrum from “we manage everything” (IaaS) to “we manage nothing” (SaaS). Most real-world architectures mix and match — an EC2 instance (IaaS) running our app, with RDS (managed PaaS-ish) for the database, Lambda (serverless) for background jobs, and Slack (SaaS) for alerts.


32

VPC and Network Architecture

intermediate cloud vpc subnets security-groups

When we spin up resources in the cloud, they don’t just float around in the open internet. They live inside a VPC (Virtual Private Cloud) — our own isolated network. Think of it like having our own private building inside a massive office complex. We control who gets in and who doesn’t.

What’s a VPC?

A VPC is a logically isolated section of the cloud where we launch resources. We define the IP range, create subnets, set up route tables, and control traffic flow.

Every cloud provider has this concept:

  • AWS — VPC
  • GCP — VPC Network
  • Azure — Virtual Network (VNet)

Subnets: Public vs Private

A subnet is a slice of our VPC’s IP range. We split our VPC into subnets to separate resources.

VPC Architecture
VPC (10.0.0.0/16)
Public Subnet (10.0.1.0/24)
Load Balancers
NAT Gateway
Bastion Host
Private Subnet (10.0.2.0/24)
App Servers
Databases
Internal Services
Internet ↔ Internet Gateway ↔ Public Subnet ↔ Private Subnet

Public subnet — has a route to the internet via an Internet Gateway. Resources here can have public IPs. We put load balancers and bastion hosts here.

Private subnet — no direct internet access. Databases and app servers live here. They can reach the internet outbound through a NAT Gateway (sitting in the public subnet), but nobody from the internet can reach them directly.

Key Networking Components

Internet Gateway (IGW) — the door between our VPC and the public internet. Attach it to the VPC, add a route, and public subnets can talk to the world.

NAT Gateway — lets private subnet resources access the internet (for updates, API calls) without exposing them. Traffic goes out but can’t come in.

Route Table — a set of rules that determine where network traffic goes. Each subnet is associated with a route table.

# Public subnet route table (conceptual)
Destination     Target
10.0.0.0/16     local          # traffic within VPC stays local
0.0.0.0/0       igw-123abc     # everything else goes to internet

# Private subnet route table
Destination     Target
10.0.0.0/16     local          # traffic within VPC stays local
0.0.0.0/0       nat-456def     # outbound internet goes through NAT

Security Groups vs NACLs

These are our two layers of firewall.

Security Groups — act at the instance level (attached to an EC2, RDS, etc.). They’re stateful — if we allow traffic in, the response is automatically allowed out.

# Security group for a web server
Inbound:
  Port 80 (HTTP)    from 0.0.0.0/0     # allow web traffic
  Port 443 (HTTPS)  from 0.0.0.0/0     # allow secure web traffic
  Port 22 (SSH)     from 10.0.1.0/24   # SSH only from public subnet

Outbound:
  All traffic       to 0.0.0.0/0       # allow all outbound

NACLs (Network Access Control Lists) — act at the subnet level. They’re stateless — we must explicitly allow both inbound and outbound traffic. Think of NACLs as the building’s front gate, and security groups as each apartment’s door lock.

The only difference between them: security groups are stateful (remember connections), NACLs are stateless (check every packet independently).

VPC Peering

Sometimes we need two VPCs to talk to each other — maybe our staging VPC needs to reach a shared database VPC. VPC peering creates a private network connection between them. Traffic stays on the cloud provider’s backbone and never touches the public internet.

In simple language, a VPC is our private network in the cloud. Public subnets face the internet, private subnets hide behind NAT. Security groups and NACLs are our bouncers. Once we understand this, every cloud architecture diagram starts making sense.


33

IAM and Access Management

intermediate cloud iam roles policies security

IAM (Identity and Access Management) controls who can do what on which resources. It’s the lock and key system of the cloud. Get it wrong, and we’re one misconfigured policy away from a data breach. Get it right, and even a compromised credential can’t do much damage.

Core Concepts

IAM Building Blocks
User       — a person (developer, admin). Has a username + credentials.
Group      — a collection of users. Attach policies to the group, not each user.
Role       — an identity that anyone (or any service) can assume temporarily.
Policy     — a JSON document that defines permissions (allow/deny actions).
Principal  — whoever is making the request (user, role, or service).

In simple language: Users are people, Roles are hats anyone can wear temporarily, Policies are the rulebook, and Groups are teams that share the same rulebook.

Principle of Least Privilege

This is the golden rule of IAM: give the minimum permissions needed to do the job, and nothing more.

A developer who deploys containers doesn’t need permission to delete the VPC. A Lambda function that reads from S3 doesn’t need permission to write to DynamoDB. Every extra permission is an extra attack surface.

Policy Structure

A policy is just a JSON document with three main fields:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-bucket",
        "arn:aws:s3:::my-app-bucket/*"
      ]
    }
  ]
}
  • EffectAllow or Deny (Deny always wins if there’s a conflict)
  • Action — what operations are permitted (s3:GetObject, ec2:StartInstances)
  • Resource — which specific resources the policy applies to (ARNs)

The tighter we scope the Action and Resource, the safer we are.

The Assume Role Pattern

Instead of giving applications long-lived access keys, we use roles. An EC2 instance or Lambda function “assumes” a role and gets temporary credentials that expire automatically.

# instead of this (bad — long-lived keys in env vars):
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=wJalr...

# we do this (good — EC2 instance profile with a role):
# 1. Create a role with the needed permissions
# 2. Attach the role to the EC2 instance
# 3. The AWS SDK automatically uses the instance's role credentials
# No keys in code or environment. Credentials rotate automatically.

This is how it works across providers:

  • AWS — Instance Profiles, Lambda execution roles
  • GCP — Service Accounts
  • Azure — Managed Identities

MFA (Multi-Factor Authentication)

Every human user should have MFA enabled. Period. Especially the root/admin account. A stolen password with MFA is useless. A stolen password without MFA is a catastrophe.

# enforce MFA with a policy condition
"Condition": {
  "BoolIfExists": {
    "aws:MultiFactorAuthPresent": "true"
  }
}

Common Mistakes

Overly permissive policies — using "Action": "*" and "Resource": "*" gives full admin access. Never do this for application roles.

Long-lived credentials — access keys that never rotate. Use roles with temporary credentials instead.

Not using groups — attaching policies directly to users means managing permissions one by one. Use groups: put all developers in a “Developers” group, attach the policy to the group.

Ignoring the root account — the root account can do anything and can’t be restricted. Lock it down with MFA, use it only for billing, and create admin IAM users for daily work.

No audit logging — without CloudTrail (AWS) or Cloud Audit Logs (GCP), we have no idea who did what. Always enable it.

In simple language, IAM is about answering three questions for every request: Who are we? What are we trying to do? Are we allowed? Getting this right is the single most impactful security practice in the cloud.


34

Cloud Storage and Databases

intermediate cloud s3 rds storage databases

The cloud offers a dozen different ways to store data, and picking the wrong one is a fast way to burn money or build something painfully slow. Let’s break down the main categories and when to use each.

Object Storage

Think of object storage as a giant key-value store for files. We upload a file, get a URL, and that’s it. No folders (the “folders” in S3 are just prefixes in the key name), no file system, no mounting.

Use for: static assets (images, videos), backups, log archives, data lake files, static website hosting.

Examples: AWS S3, Google Cloud Storage, Azure Blob Storage.

# upload a file to S3
aws s3 cp backup.tar.gz s3://my-bucket/backups/backup-2024-03-15.tar.gz

# make a file publicly readable (careful!)
aws s3api put-object-acl --bucket my-bucket \
  --key public/logo.png --acl public-read

# sync a directory (like rsync for S3)
aws s3 sync ./dist s3://my-website-bucket --delete

Object storage is insanely cheap for large amounts of data. S3 Standard costs about $0.023 per GB/month. For rarely accessed data, S3 Glacier drops to $0.004/GB.

Block Storage

Block storage is a virtual hard drive that we attach to a VM. It shows up as a disk device and we can format it with any filesystem.

Use for: OS disks, database storage, anything that needs a traditional filesystem.

Examples: AWS EBS, Google Persistent Disk, Azure Managed Disks.

The key difference from object storage: block storage is attached to one instance and provides low-latency, high-IOPS access. Object storage is accessed over HTTP and is for bulk data.

File Storage (Shared)

Sometimes multiple servers need to read/write the same files. That’s where network file storage comes in — it’s like NFS in the cloud.

Use for: shared configuration, CMS media directories, legacy apps that expect a filesystem.

Examples: AWS EFS, Google Filestore, Azure Files.

Storage Comparison

Picking the Right Storage
Object (S3)     — files via HTTP, unlimited scale, cheapest
Block (EBS)     — virtual disk, one VM, fastest IOPS
File (EFS)      — shared NFS, multiple VMs, more expensive
Database       — structured data, queries, transactions

Managed Databases

Instead of installing PostgreSQL on an EC2 instance and managing backups, patches, and replication ourselves, we can use a managed database service. The provider handles the infra; we handle the data.

Relational (SQL):

  • AWS RDS — MySQL, PostgreSQL, MariaDB, Oracle, SQL Server
  • Google Cloud SQL — MySQL, PostgreSQL, SQL Server
  • Azure Database — MySQL, PostgreSQL, SQL Server

NoSQL:

  • DynamoDB (AWS) — key-value / document, single-digit ms latency
  • Firestore (GCP) — document database for apps
  • Cosmos DB (Azure) — multi-model, globally distributed

Why managed? Automatic backups, point-in-time recovery, read replicas, automatic failover, and patching. All the ops stuff that keeps DBAs up at night — handled for us.

Caching

For data that’s read way more than written, sticking a cache in front of the database can cut response times from 50ms to 1ms.

# common pattern:
# 1. Check Redis cache
# 2. If hit → return cached data (fast!)
# 3. If miss → query database → store in cache → return
  • AWS ElastiCache — managed Redis or Memcached
  • Google Memorystore — managed Redis
  • Azure Cache for Redis — managed Redis

Choosing the Right Storage

# Quick decision guide:
# Storing user uploads, images, backups?    → Object storage (S3)
# Need a disk for a VM or database?         → Block storage (EBS)
# Multiple servers need shared files?       → File storage (EFS)
# Structured data with queries?             → Managed database (RDS)
# Need sub-millisecond reads?               → Cache (Redis)

In simple language, cloud storage is about matching the tool to the job. S3 for files, EBS for disks, RDS for structured data, Redis for speed. Using the wrong type works but costs more and performs worse.


35

Serverless and Managed Services

intermediate serverless lambda api-gateway event-driven

Serverless doesn’t mean “no servers.” The servers are still there — we just don’t think about them. We write a function, upload it, and the cloud runs it whenever something triggers it. No provisioning, no patching, no scaling decisions. We pay per invocation, not per hour.

How It Works

We write a function that handles one event — an HTTP request, a file upload, a message from a queue. The cloud provider:

  1. Receives the event
  2. Spins up our function in a container (or reuses a warm one)
  3. Runs our code
  4. Shuts down when done
  5. Charges us for the milliseconds of execution
# AWS Lambda example — a simple Node.js function
# handler.js
exports.handler = async (event) => {
  const name = event.queryStringParameters?.name || "World";
  return {
    statusCode: 200,
    body: JSON.stringify({ message: `Hello, ${name}!` }),
  };
};

That’s it. No Express server, no port configuration, no process manager. Just a function.

The Event-Driven Pattern

Serverless shines when events drive the workflow. Instead of a monolith doing everything, we chain small functions together.

Event-Driven Architecture
Example: Image Processing Pipeline
  User uploads image → S3 Bucket
    ↓ triggers
  Lambda: resize → saves thumbnail to S3
    ↓ triggers
  Lambda: metadata → writes record to DynamoDB
    ↓ triggers
  SNS notification → emails the user "upload complete"

No server running 24/7 waiting for uploads. Each piece runs only when triggered, scales independently, and costs nothing when idle.

Key Serverless Services

Compute:

  • AWS Lambda / Google Cloud Functions / Azure Functions — run code on events
  • Cloudflare Workers — runs at the edge (CDN nodes), ultra-low latency

API Layer:

  • API Gateway (AWS, GCP) — routes HTTP requests to Lambda functions, handles auth, rate limiting, CORS

Messaging:

  • SQS (Simple Queue Service) — message queue, guarantees delivery, decouples services
  • SNS (Simple Notification Service) — pub/sub, fan out one event to many subscribers
  • EventBridge — event bus for routing events between AWS services
# AWS SAM template — Lambda behind API Gateway
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  HelloFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handler.handler
      Runtime: nodejs20.x
      Timeout: 10
      MemorySize: 128
      Events:
        Api:
          Type: Api
          Properties:
            Path: /hello
            Method: get

When Serverless Makes Sense

  • Sporadic workloads — a function that runs 100 times a day costs almost nothing
  • Event processing — file uploads, webhooks, queue consumers
  • APIs with variable traffic — scales to zero when nobody’s calling
  • Cron jobs — scheduled Lambda beats keeping a server running 24/7 for a 5-minute task

When It Doesn’t

Cold starts — if a function hasn’t been invoked recently, the first invocation takes 100ms-2s to spin up. For latency-sensitive APIs, this can hurt.

Execution limits — Lambda maxes out at 15 minutes. Long-running tasks (video processing, ML training) need a different approach.

Vendor lock-in — Lambda code is easy to port, but once we’re deep into SQS + SNS + Step Functions + DynamoDB, switching providers is a massive effort.

Complex local development — replicating the Lambda + API Gateway + DynamoDB stack locally requires tools like SAM or Serverless Framework. It’s doable but more friction than running a local Express server.

High-throughput, steady traffic — if we’re running 10 million invocations per hour consistently, a container on ECS or a Kubernetes pod would be cheaper.

Cost Model

# Lambda pricing (approximate):
# $0.20 per 1 million requests
# $0.0000166667 per GB-second of compute
#
# Example: 1 million requests, each using 128MB for 200ms
# Compute: 1M × 0.128GB × 0.2s × $0.0000166667 = $0.43
# Requests: 1M × $0.0000002 = $0.20
# Total: ~$0.63 for 1 million invocations

In simple language, serverless is about writing code and letting the cloud handle everything else. It’s perfect for bursty, event-driven workloads. But for always-on, high-throughput services, running our own containers is usually cheaper and gives us more control.


Infrastructure as Code

36

Infrastructure as Code — Concepts and Benefits

beginner iac idempotency declarative automation

Imagine we need to set up 10 servers, each with the same software, same firewall rules, same users. Doing it by clicking through a cloud console? That’s a recipe for missed steps and “it works on my server” problems. Infrastructure as Code (IaC) means we write code that defines our infrastructure, and tools execute that code to create the actual resources.

In simple language, instead of clicking “Create Instance” in the AWS console, we write a file that says “I want an instance with these specs” and a tool makes it happen.

Why IaC Matters

  • Reproducibility — run the same code, get the same infrastructure. Every time.
  • Version control — our infra lives in Git. We can review changes in PRs, roll back mistakes, and see who changed what.
  • Disaster recovery — server died? Run the code again. Done.
  • Consistency — no more “this staging server is slightly different from prod” situations.
  • Speed — spinning up a new environment goes from hours of clicking to minutes of automation.

Declarative vs Imperative

These are the two approaches to IaC, and the difference is like ordering food at a restaurant.

Declarative (WHAT)
"I want 3 nginx servers with 4GB RAM"
The tool figures out HOW to get there
Run it again? No changes (already in desired state)
Tools: Terraform, CloudFormation, Pulumi
Imperative (HOW)
"Create a server, then install nginx, then open port 80..."
We write every step ourselves
Run it again? Might break (duplicate resources)
Tools: Bash scripts, AWS CLI, SDKs

Most modern IaC tools are declarative. We describe the end state, and the tool calculates the steps needed to reach it.

Idempotency

This is a fancy word for a simple concept: running the same thing twice gives the same result. If we declare “I want 3 servers” and we already have 3 servers, a good IaC tool does nothing. It doesn’t create 3 more. That’s idempotency.

This is huge for safety. We can re-run our IaC code without fear of creating duplicate resources or breaking things.

  • Terraform — the most popular, works with every cloud provider. Declarative, uses HCL language.
  • Pulumi — like Terraform but we write real code (Python, TypeScript, Go) instead of a custom language.
  • AWS CloudFormation — AWS-only, uses YAML/JSON. Tightly integrated with AWS services.
  • Ansible — technically a configuration management tool, but can also provision infrastructure. Uses YAML playbooks.

Think of it like this: Terraform/CloudFormation create the servers, and Ansible configures what’s inside them. They’re often used together.

In simple language, IaC is treating our servers and cloud resources the same way we treat application code — written in files, stored in Git, reviewed in PRs, and reproducible on demand.


37

Terraform Fundamentals

intermediate terraform hcl providers resources

Terraform is a declarative IaC tool by HashiCorp. We describe what infrastructure we want in files, and Terraform figures out how to create it, update it, or tear it down. It works with AWS, GCP, Azure, Cloudflare, and hundreds of other providers.

It uses its own language called HCL (HashiCorp Configuration Language). It looks a lot like JSON but is more human-friendly.

Core Concepts

Providers — plugins that let Terraform talk to a specific platform. The AWS provider knows how to create EC2 instances, S3 buckets, etc. We always declare which providers we need.

Resources — the actual infrastructure we define. An S3 bucket, a DNS record, a Kubernetes cluster — each is a resource.

Variables — inputs to our configuration. Think of them like function parameters.

Outputs — values we want to expose after Terraform runs (like an IP address or a URL).

Data sources — let us read information about existing infrastructure that Terraform doesn’t manage.

A Simple Example

Here’s a Terraform config that creates an S3 bucket:

# Tell Terraform we need the AWS provider
provider "aws" {
  region = "ap-south-1"
}

# Define a variable so we can reuse this config
variable "bucket_name" {
  description = "Name of the S3 bucket"
  type        = string
  default     = "my-app-assets-2024"
}

# Create the actual S3 bucket
resource "aws_s3_bucket" "assets" {
  bucket = var.bucket_name

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

# Output the bucket's ARN so other configs can reference it
output "bucket_arn" {
  value = aws_s3_bucket.assets.arn
}

Every resource has a type (aws_s3_bucket) and a local name (assets). We reference it elsewhere as aws_s3_bucket.assets.

The Workflow

Terraform has a simple four-step workflow that we’ll use constantly:

# 1. Initialize — downloads provider plugins
terraform init

# 2. Plan — shows what will change (dry run, nothing is created yet)
terraform plan

# 3. Apply — actually creates/updates the infrastructure
terraform apply

# 4. Destroy — tears everything down (careful with this one!)
terraform destroy

terraform plan is our safety net. It shows exactly what will be created, changed, or destroyed before we commit. Always read the plan output before running apply.

Variable Types

Terraform supports several variable types:

variable "instance_count" {
  type    = number
  default = 2
}

variable "enable_monitoring" {
  type    = bool
  default = true
}

variable "allowed_ips" {
  type    = list(string)
  default = ["10.0.0.0/8", "172.16.0.0/12"]
}

variable "tags" {
  type = map(string)
  default = {
    team = "platform"
  }
}

We can pass variable values via CLI flags (-var), .tfvars files, or environment variables (TF_VAR_bucket_name).

Data Sources

Sometimes we need info about resources that already exist — maybe a VPC that was created manually or by another team:

# Look up an existing VPC by tag
data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }
}

# Use it in a resource
resource "aws_subnet" "app" {
  vpc_id     = data.aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

The only difference between resource and data is that resource creates things and data reads things.

In simple language, Terraform lets us describe our cloud infrastructure in text files, preview changes before they happen, and apply them with confidence. It’s like having a blueprint for our entire cloud setup.


38

Terraform State and Modules

advanced terraform state modules remote-backend

When Terraform creates infrastructure, it needs to remember what it created. Otherwise, how would it know what to update or destroy next time? That memory is called state.

What Is State?

State is a JSON file (terraform.tfstate) that maps our Terraform configuration to real-world resources. When we write resource "aws_s3_bucket" "assets", Terraform stores the actual bucket ID, ARN, and all its properties in state.

Without state, Terraform would have no idea what already exists. It would try to create everything from scratch every time.

Local vs Remote State

By default, state lives in a local file. That’s fine for solo experimentation, but terrible for teams. If two people run terraform apply at the same time with different local state files, things get destroyed or duplicated.

Remote state solves this. We store state in a shared location — usually an S3 bucket with a DynamoDB table for locking.

# Store state remotely in S3
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "terraform-locks"  # prevents concurrent writes
    encrypt        = true
  }
}

Now everyone on the team reads and writes to the same state. No conflicts.

State Locking

When someone runs terraform apply, the state gets locked. This prevents another person from making changes at the same time. Think of it like a database transaction lock — only one writer at a time.

The DynamoDB table in the example above handles this. If we try to run apply while someone else is already running it, Terraform says “state is locked” and waits.

Why State Matters

  • Drift detection — Terraform compares state with what actually exists. If someone manually changed a resource in the console, terraform plan catches the difference.
  • Dependency tracking — state records relationships between resources. Terraform knows it needs to create the VPC before the subnet.
  • Performance — instead of querying every resource from the cloud API, Terraform reads state to know what exists.

Useful State Commands

# List all resources Terraform is tracking
terraform state list

# Show details of a specific resource
terraform state show aws_s3_bucket.assets

# Remove a resource from state (Terraform forgets it, doesn't delete it)
terraform state rm aws_s3_bucket.legacy

# Move a resource (renamed something in code?)
terraform state mv aws_s3_bucket.old aws_s3_bucket.new

Workspaces

Workspaces let us manage multiple environments (dev, staging, prod) with the same code but separate state files:

terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform workspace list

Each workspace gets its own state. We can use terraform.workspace in our config to change behavior per environment.

Modules

As our infrastructure grows, copy-pasting resource blocks is painful. Modules are reusable packages of Terraform configuration. Think of them like functions — we define them once and call them with different inputs.

A module is just a directory with .tf files:

modules/
  s3-bucket/
    main.tf         # resource definitions
    variables.tf    # input parameters
    outputs.tf      # exposed values
# modules/s3-bucket/variables.tf
variable "bucket_name" {
  type = string
}

variable "environment" {
  type    = string
  default = "dev"
}
# modules/s3-bucket/main.tf
resource "aws_s3_bucket" "this" {
  bucket = var.bucket_name
  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}
# modules/s3-bucket/outputs.tf
output "bucket_arn" {
  value = aws_s3_bucket.this.arn
}

Now we call it from our root config:

# Use our module for different buckets
module "app_assets" {
  source      = "./modules/s3-bucket"
  bucket_name = "my-app-assets"
  environment = "production"
}

module "logs" {
  source      = "./modules/s3-bucket"
  bucket_name = "my-app-logs"
  environment = "production"
}

We can also use modules from the Terraform Registry — thousands of community-maintained modules for common patterns.

In simple language, state is Terraform’s memory of what it built, remote backends make that memory shared and safe, and modules let us stop repeating ourselves. These three concepts are what separate a messy Terraform project from a clean one.


39

Ansible Basics

intermediate ansible playbooks roles configuration-management

Terraform creates infrastructure (servers, networks, databases). But once those servers exist, who installs software, copies config files, and starts services? That’s where Ansible comes in.

Ansible is a configuration management tool. It connects to our servers over SSH, runs tasks we define, and makes sure everything is set up the way we want. The best part — it’s agentless. We don’t need to install anything on the target machines. Just SSH access.

Ansible vs Terraform

These two are not competitors — they’re teammates.

Terraform
Creates infrastructure
Provisions VMs, networks, databases
Tracks state
Declarative (HCL)
Ansible
Configures infrastructure
Installs packages, copies files, starts services
Stateless (no state file)
Procedural (YAML playbooks)

Think of it like building a house: Terraform lays the foundation and builds the walls. Ansible paints, installs furniture, and sets up the Wi-Fi.

Inventory

The inventory is a file that lists the servers Ansible should manage:

# inventory.yml
all:
  children:
    webservers:
      hosts:
        web1:
          ansible_host: 10.0.1.10
        web2:
          ansible_host: 10.0.1.11
    databases:
      hosts:
        db1:
          ansible_host: 10.0.2.10

We group hosts so we can target them separately — “install nginx on webservers” or “update postgres on databases.”

Playbooks

A playbook is a YAML file that describes a series of tasks to run on specific hosts:

# setup-web.yml
- name: Setup web servers
  hosts: webservers
  become: true  # run as root (sudo)

  tasks:
    - name: Update apt cache
      apt:
        update_cache: true

    - name: Install nginx
      apt:
        name: nginx
        state: present  # make sure it's installed

    - name: Copy our nginx config
      copy:
        src: ./files/nginx.conf
        dest: /etc/nginx/nginx.conf
      notify: restart nginx  # trigger handler if file changed

    - name: Ensure nginx is running
      service:
        name: nginx
        state: started
        enabled: true  # start on boot

  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

We run it with:

ansible-playbook -i inventory.yml setup-web.yml

Modules

Each task uses a module — a built-in action that Ansible knows how to perform. We used apt, copy, and service above. There are thousands of modules:

  • apt / yum — install packages
  • copy — copy files to remote hosts
  • template — copy files with variable substitution (Jinja2)
  • service — start/stop/restart services
  • user — create/manage system users
  • docker_container — manage Docker containers
  • git — clone repositories

Idempotency in Practice

Ansible modules are designed to be idempotent. If we say state: present for nginx and nginx is already installed, Ansible does nothing. If we say state: started for a service and it’s already running, nothing happens.

This means we can safely re-run a playbook multiple times. The output even shows us: ok (already good), changed (made a change), or failed.

Roles

As our playbooks grow, dumping everything into one file gets messy. Roles organize tasks, files, templates, and variables into a standard structure:

roles/
  nginx/
    tasks/main.yml      # the task list
    handlers/main.yml   # handlers (restart, reload)
    templates/           # Jinja2 template files
    files/               # static files to copy
    defaults/main.yml   # default variable values

Then our playbook becomes clean:

- name: Setup web servers
  hosts: webservers
  become: true
  roles:
    - nginx
    - certbot
    - app-deploy

Each role is self-contained and reusable. We can share roles via Ansible Galaxy, which is like npm but for Ansible roles.

In simple language, Ansible is the tool that SSHs into our servers and makes sure they’re configured exactly the way we described. No agents to install, no state to manage — just YAML files and SSH.


Observability & Reliability

40

Monitoring and Alerting

intermediate monitoring prometheus grafana alerting

Our app is deployed. Users are hitting it. How do we know if it’s healthy? If response times are climbing? If the disk is filling up? We can’t just SSH in and check every 5 minutes. We need monitoring — systems that watch our infrastructure and applications, collect data, and alert us when something goes wrong.

The goal is simple: know about problems before our users do.

Metric Types

Before diving into tools, let’s understand the four types of metrics we’ll work with:

  • Counter — a value that only goes up. Total requests served, total errors, total bytes sent. We usually look at the rate (requests per second).
  • Gauge — a value that goes up and down. Current CPU usage, memory in use, active connections.
  • Histogram — tracks the distribution of values. Great for response times — we can see the median, 95th percentile, 99th percentile.
  • Summary — similar to histogram but calculated on the client side. Less flexible but cheaper.

Prometheus

Prometheus is the most popular open-source monitoring system. It uses a pull model — Prometheus scrapes metrics from our applications at regular intervals, rather than our apps pushing metrics to it.

Prometheus Architecture
App A
/metrics
App B
/metrics
← scrapes ←
Prometheus
TSDB + PromQL
→ queries →
Grafana
Dashboards
Alertmanager → Slack / PagerDuty / Email

Our apps expose a /metrics endpoint. Prometheus scrapes it every 15-30 seconds and stores the data in its time-series database. We then query it with PromQL.

# Some basic PromQL examples

# Total HTTP requests in the last 5 minutes
rate(http_requests_total[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Current memory usage
process_resident_memory_bytes

Grafana

Grafana is a visualization tool that connects to Prometheus (and many other data sources) and lets us build beautiful dashboards. We write PromQL queries, and Grafana draws the graphs.

Most teams have dashboards for: system resources (CPU, memory, disk), application metrics (request rate, error rate, latency), and business metrics (signups, orders).

The USE Method (for Infrastructure)

When debugging infrastructure problems, check these three things for every resource (CPU, memory, disk, network):

  • Utilization — how busy is it? (CPU at 90%)
  • Saturation — is work queuing up? (disk I/O queue length)
  • Errors — are there failures? (disk read errors)

The RED Method (for Services)

For application-level monitoring, track:

  • Rate — requests per second
  • Errors — failed requests per second
  • Duration — how long requests take (latency)

If we nail USE for infra and RED for services, we’ll catch most problems.

Alerting Best Practices

Setting up alerts is easy. Setting up good alerts is hard. The biggest mistake is alert fatigue — too many alerts that don’t need human action, so people start ignoring them.

  • Alert on symptoms, not causes. Alert on “API error rate > 5%” not “CPU > 80%”. High CPU might be fine during a batch job.
  • Every alert should be actionable. If there’s nothing someone can do about it, it’s not an alert — it’s a dashboard metric.
  • Set severity levels. Not everything needs to page someone at 3 AM. Use critical (page), warning (Slack), and info (dashboard only).
  • Include runbook links. Every alert should link to a doc explaining what to check and how to fix it.

In simple language, monitoring is our early warning system. Prometheus collects the data, Grafana shows us the pretty graphs, and Alertmanager wakes us up when something’s actually broken. USE for infra, RED for services — that covers 90% of what we need.


41

Logging and Log Aggregation

intermediate logging elk structured-logging aggregation

When our app was a single server, we’d SSH in and tail -f /var/log/app.log. Easy. But now we’ve got 20 containers across 5 machines, some scaling up and down automatically. Where do the logs go? How do we find the one error message that explains why a user’s request failed?

We need centralized logging — all logs from all services flowing into one searchable place.

Structured Logging

The first step is making our logs machine-readable. Unstructured logs look like this:

[2024-03-15 14:23:01] ERROR - Failed to process order #4521 for user john@example.com

Good for humans, terrible for searching and filtering. Structured logs use JSON with consistent fields:

{
  "timestamp": "2024-03-15T14:23:01Z",
  "level": "error",
  "message": "Failed to process order",
  "service": "order-service",
  "order_id": 4521,
  "user_email": "john@example.com",
  "error": "payment_declined",
  "correlation_id": "abc-123-def"
}

Now we can filter by level: error, search by order_id, or group by service. Every log entry follows the same shape, which makes aggregation tools much more useful.

Log Levels

Use them consistently across all services:

  • DEBUG — verbose detail for development. Never in production unless we’re actively debugging.
  • INFO — normal operations. “Server started”, “Order created”, “User logged in.”
  • WARN — something unexpected but the system handled it. “Retry succeeded after 2 attempts.”
  • ERROR — something failed and needs attention. “Database connection lost”, “Payment API returned 500.”
  • FATAL — the process is crashing. “Out of memory”, “Cannot bind to port.”

A good rule: production should run at INFO level, and we should be able to flip to DEBUG without restarting (via config or env var).

The ELK Stack

ELK is the most popular log aggregation setup. It stands for:

ELK Stack Flow
Services
stdout / files
Logstash
Collect, parse, transform
Elasticsearch
Store & index
Kibana
Search & visualize
  • Elasticsearch — a search engine that stores and indexes logs. We can query logs by any field in milliseconds.
  • Logstash — collects logs from various sources, parses them, transforms fields, and ships them to Elasticsearch.
  • Kibana — the web UI. We search logs, build dashboards, and set up alerts.

A popular alternative is EFK — replacing Logstash with Fluentd. Fluentd is lighter, more cloud-native, and widely used in Kubernetes environments. Same idea, different collector.

Correlation IDs

In a microservices world, a single user request might hit 5 different services. When something fails, how do we trace the request through all of them?

We generate a unique correlation ID (also called trace ID) at the entry point and pass it through every service call. Every log line includes this ID.

{"correlation_id": "abc-123", "service": "api-gateway", "message": "Received order request"}
{"correlation_id": "abc-123", "service": "order-service", "message": "Creating order"}
{"correlation_id": "abc-123", "service": "payment-service", "message": "Charging card"}
{"correlation_id": "abc-123", "service": "payment-service", "level": "error", "message": "Card declined"}

Now we search for correlation_id: abc-123 in Kibana and see the entire journey. This is the foundation of distributed tracing — tools like Jaeger and Zipkin take this further with visual timeline views.

Log Retention and Rotation

Logs grow fast. We need a plan:

  • Rotation — tools like logrotate on Linux automatically compress and archive old log files. Docker also supports log rotation via its logging drivers.
  • Retention — keep hot logs (last 7-30 days) in Elasticsearch for fast searching. Move older logs to cold storage (S3, Glacier) for compliance. Delete after the retention period.
  • Index lifecycle management — Elasticsearch has built-in ILM policies that automatically move, shrink, and delete indices based on age.
# Docker log rotation — set in daemon.json or per-container
docker run -d \
  --log-opt max-size=50m \
  --log-opt max-file=3 \
  my-app:latest

In simple language, centralized logging is about getting all our logs into one searchable place. Structure them as JSON, slap a correlation ID on every request, pipe everything through ELK (or EFK), and we’ll be able to debug anything across any number of services.


42

Secrets Management and TLS

advanced secrets vault tls certificates encryption

Here’s a nightmare scenario: someone pushes an .env file with the production database password to a public GitHub repo. A bot scrapes it within minutes. The database is compromised. This happens more often than we’d like to admit.

Secrets management is about making sure passwords, API keys, tokens, and certificates never end up in code, Git history, or plain-text config files.

The Problem with Environment Variables

The most common approach is environment variables. They’re simple, but they have real limitations:

  • Env vars are visible to any process on the machine (/proc/<pid>/environ on Linux).
  • They get logged accidentally — a debug log that prints all env vars leaks every secret.
  • No access control — if we can SSH into the machine, we can see all of them.
  • No audit trail — who accessed which secret? No idea.
  • No rotation — changing a secret means redeploying every service that uses it.

Env vars are fine for development. For production, we need something better.

Approaches to Secrets Management

1. Mounted files / volumes — secrets are written to files that containers mount. Better than env vars because we can restrict file permissions, but still static.

2. Cloud-native secrets managers — AWS Secrets Manager, GCP Secret Manager, Azure Key Vault. They store secrets encrypted, provide access control, audit logs, and automatic rotation.

3. HashiCorp Vault — the most popular open-source option. Cloud-agnostic, works everywhere.

HashiCorp Vault Basics

Vault is a tool for securely storing and accessing secrets. It encrypts everything at rest, provides fine-grained access policies, keeps an audit log of every access, and can even generate short-lived credentials on the fly.

# Store a secret
vault kv put secret/myapp/db \
  username="admin" \
  password="s3cur3-p@ss"

# Read it back
vault kv get secret/myapp/db

# Read just one field
vault kv get -field=password secret/myapp/db

Vault also supports dynamic secrets — instead of storing a static database password, Vault creates a new temporary database user on demand. When the lease expires, Vault deletes the user. If credentials leak, they’re already expired. This is a game changer for security.

Kubernetes Secrets

Kubernetes has a built-in Secret resource. It’s better than hardcoding, but has a catch — Secrets are only base64-encoded, not encrypted (by default):

apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
data:
  username: YWRtaW4=      # base64 of "admin"
  password: czNjdXIzLXBAc3M=  # base64 of "s3cur3-p@ss"

Anyone with cluster access can decode these. For real security, we pair K8s Secrets with Sealed Secrets (encrypts secrets in Git, only the cluster can decrypt) or integrate with Vault using the Vault Agent Injector.

TLS Refresher

TLS (Transport Layer Security) encrypts data in transit. Every HTTPS connection uses TLS. Here’s the quick breakdown:

  • Certificate — a file that proves “I am who I say I am.” Contains a public key, signed by a trusted authority.
  • Certificate Authority (CA) — the trusted party that signs certificates. Browsers trust a handful of CAs (Let’s Encrypt, DigiCert, etc.).
  • Private key — stays on our server, never shared. Used to decrypt data encrypted with our public key.
  • The handshake — client connects, server presents certificate, client verifies it’s signed by a trusted CA, they agree on encryption keys, data flows encrypted.

cert-manager for Auto-Renewal

Managing TLS certificates manually is painful and error-prone. cert-manager automates this in Kubernetes. It requests certificates from Let’s Encrypt (or other CAs), installs them, and renews them before they expire — all automatically.

# cert-manager issuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx

Outside Kubernetes, tools like Caddy (which this very site uses!) handle TLS certificates automatically too.

mTLS — Mutual TLS

Normal TLS is one-way: the client verifies the server. mTLS (mutual TLS) goes both ways — the server also verifies the client. Both sides present certificates.

This is huge for service-to-service communication. Instead of relying on network firewalls alone, each service proves its identity cryptographically. Service meshes like Istio and Linkerd set up mTLS automatically between all services in a Kubernetes cluster.

Secrets Rotation

Even with a secrets manager, secrets should be rotated regularly. If a credential leaks, the blast radius depends on how old it is:

  • Automated rotation — Vault and cloud secrets managers can rotate passwords on a schedule.
  • Short-lived credentials — better than rotation. If a token expires in 1 hour, a leak is only dangerous for 1 hour.
  • Rotation without downtime — support multiple active versions of a secret during rotation so services can gracefully switch over.

In simple language, secrets management boils down to three rules: never put secrets in code, always encrypt them at rest and in transit, and rotate them regularly. Tools like Vault make this practical, and TLS/mTLS makes sure data is encrypted as it moves between services.


43

High Availability and Disaster Recovery

advanced ha disaster-recovery rpo rto failover

Everything fails eventually. Servers crash. Disks die. Entire data centers go offline (it happened to AWS us-east-1 multiple times). High Availability (HA) is about designing systems that keep running even when individual components fail.

In simple language, HA means our users don’t notice when something breaks because something else picks up the slack immediately.

Single Points of Failure

The first step to HA is finding our single points of failure (SPOF) — components where, if they go down, the entire system goes down. Common ones:

  • One database server with no replica
  • One load balancer
  • One data center / availability zone
  • One DNS provider
  • One person who knows how the system works (the “bus factor”)

For every SPOF we find, the fix is the same: redundancy.

Redundancy Patterns

Active-Active — multiple instances all serve traffic simultaneously. If one dies, the others keep going. This is what we do with multiple web servers behind a load balancer. More capacity, more resilient, but we need to handle shared state carefully.

Active-Passive — one instance handles all traffic. A standby instance waits and takes over if the primary fails. Simpler to reason about, but the standby is sitting idle. Common with databases (primary-replica setup).

RPO and RTO

These two metrics define our disaster recovery requirements. They’re the first questions any interviewer will ask about DR.

RPO and RTO on a Timeline
Last Backup
DISASTER
System Restored
← RPO →
How much data we lose
← RTO →
How long until we're back

RPO (Recovery Point Objective) — how much data can we afford to lose? If our RPO is 1 hour, we need backups at least every hour. RPO = 0 means we can’t lose any data (requires real-time replication).

RTO (Recovery Time Objective) — how fast must we recover? If our RTO is 30 minutes, we need to be back online within 30 minutes of a failure.

Lower RPO and RTO = more expensive. A bank might need RPO = 0 and RTO = 5 minutes. A personal blog? RPO = 24 hours and RTO = “whenever we get around to it” is fine.

Backup Strategies

  • Full backup — copy everything. Simple but slow and storage-heavy. Usually done weekly.
  • Incremental backup — only copy what changed since the last backup. Fast and small, but recovery requires replaying all increments in order.
  • Differential backup — copy what changed since the last full backup. Bigger than incremental but simpler to restore.

The golden rule: test our backups regularly. An untested backup is not a backup. We should be doing restore drills, not just hoping the backup works when disaster strikes.

# PostgreSQL backup example
# Full backup
pg_dump -h localhost -U admin mydb > backup_full_2024-03-15.sql

# Automated daily backup with compression
pg_dump -h localhost -U admin mydb | gzip > backup_$(date +%Y%m%d).sql.gz

# Restore from backup
gunzip -c backup_20240315.sql.gz | psql -h localhost -U admin mydb

Multi-Region Architecture

For serious HA, we run our system across multiple regions (e.g., Mumbai + Singapore). If an entire region goes down, the other takes over. This involves:

  • Database replication across regions (async is common, sync is slow over distance)
  • DNS failover — Route 53 health checks or similar, redirecting traffic to the healthy region
  • Data consistency trade-offs — with async replication, a failover might lose the last few seconds of writes (this ties back to RPO)

Health Checks and Failover

Load balancers and orchestrators use health checks to detect failures:

  • Liveness check — “is the process alive?” If not, restart it.
  • Readiness check — “can it handle traffic?” A server might be alive but still warming up its cache.

When a health check fails, traffic is automatically routed away from the unhealthy instance. In Kubernetes, failed liveness probes trigger a pod restart. Failed readiness probes remove the pod from the service endpoint.

Chaos Engineering

How do we know our HA setup actually works? We intentionally break things in a controlled way and see if the system recovers.

Netflix pioneered this with Chaos Monkey — a tool that randomly kills production instances during business hours. If the system handles it gracefully, great. If not, we found a weakness before our users did.

Other chaos experiments:

  • Kill a database primary and see if failover works
  • Inject network latency between services
  • Fill up a disk to 100%
  • Simulate an entire availability zone failure

The point isn’t to cause outages — it’s to build confidence that our systems can handle real failures. Start small (kill one pod), build up to bigger experiments (simulate a region failure) as confidence grows.

In simple language, HA is about redundancy (so there’s always a backup), RPO/RTO define how much failure we can tolerate, backups save us from data loss (but only if we test them), and chaos engineering proves it all actually works before a real disaster does.